optimize enqueueImpKernel #20668

lslusarczyk · 2025-11-18T10:04:24Z

Reduce overhead of SYCL over L0 in almost all scenarios by up to 3% (comparing to L0 baseline), or by (0-13% comparing to current overhead).

Multiple optimization on common path. Performance gain mainly by inlining most frequently executed simple methods, like traces or getters.

Performance improvements may be seen here:
https://intel.github.io/llvm/benchmarks/?runs=lslusarczyk_BMG_L0v2%2CBaseline_BMG_L0v2

They are seen in most of benchmarks. The most significant in CPU Count. But also in time plots

Sample improvements (dot-s on the right are new results from this PR):
SubmitKernel in order using events, SYCL, from 15.04us to 14.62us (L0 12.19us), that is overhead over L0 reduced from 23.3% to 19.9%

SubmitKernel out of order with completion using events long kernel, CPU count, SYCL, from 159k instr, to 155k instr (L0 132.5k) that is overhead over L0 reduced from 20.0% to 17.6%

SubmitKernel in order, CPU count, SYCL, from 133.4k instr, to 130.7k instr (L0 118.0k), that is overhead over L0 reduced from 13.0% yo 10.7%

... and much more.

Flamegraph of optimized method before:

Flamegraph after:

See especially how getOrCreateKernel became smaller. But not only this function.

lslusarczyk · 2025-11-18T13:53:54Z

There is one UT failing, which I will fix tomorrow, but I think this PR may be reviewed right now.

vinser52 · 2025-11-18T15:25:19Z

sycl/source/detail/device_kernel_info.hpp

+    return MUsesAssert;
+  }
+
+  const std::optional<int> &getImplicitLocalArgPos() {


Should this method market a const?

I understand that you just moved the implementation from the cpp to the hpp file. But maybe initially it was a mistake not marking this method as const?

OK, fixed, I also removed not needed const reference to small return value

vinser52 · 2025-11-18T15:32:51Z

sycl/source/detail/global_handler.cpp

 #endif
 }

-GlobalHandler *&GlobalHandler::getInstancePtr() {


what if we keep this method in cpp file and move all others to the hpp? What will be the impact on performance?
My personal opinion is that code with a static variable declared inside the method looks clearer and isolated.

what if we keep this method in cpp file and move all others to the hpp? What will be the impact on performance?

I see getInstancePtr is used only in .cpp. Should be also fast. But this function is not needed any more. I removed it in favor of just using variable.

My personal opinion is that code with a static variable declared inside the method looks clearer and isolated.

You are right. However on critical path we should avoid this pattern, as it is slower than global variable outside function. See this SO answer. I worked in another project where we indeed observed how much static variable inside a method pattern can deteriorate performance.

sycl/source/detail/global_handler.hpp

vinser52 · 2025-11-18T15:35:40Z

sycl/source/detail/global_handler.hpp


-  /// \return true if the instance has not been deallocated yet.
-  static bool isInstanceAlive();
+  static GlobalHandler *&getInstancePtr() { return RTGlobalObjHandler; }


Why is it public now?

my mistake, anyway I'm removing this function

vinser52

I think there are two parts in this PR:

There are NFC changes when you move method implementations from cpp to hpp file.
Optimization of tracing path, which is a functional change.

I would like to see two PR: the first is NFC changes, and the second is refactoring tracing/config reading paths. It would be easier to review, but what is more important to me is how much each PR contributes to the performance.

@sergey-semenov recently tried to enable LTO and there was almost no perf impact. That is why I am curious how much moving methods impl from cpp to hpp files improves perf.

lslusarczyk · 2025-11-19T13:24:49Z

I think there are two parts in this PR:
1. There are NFC changes when you move method implementations from cpp to hpp file.

2. Optimization of tracing path, which is a functional change.
I would like to see two PR: the first is NFC changes, and the second is refactoring tracing/config reading paths. It would be easier to review, but what is more important to me is how much each PR contributes to the performance.

Twor PRs created. Please review:

Inpact in performance can be best split by looking at instruction plots as time plots have high variance to reliable split impact.

Second dot of short line shows impact when we remove traces optimization (just inlines optimizations).

Out of order example.

old SYCL: 144927, new SYCL with inlines optimization: 142057, new SYCL with both optimizations: 141817

In terms of overhead reduction over UR (117309), this is:

23.54% -> 20.89% with both optimizations and
23.54% -> 21.10% with just inlines.

SubmitKernel out of order long kernel, CPU count(1)

So inlinig has a major effect.

Tracing optimization positive effect is however clearly visible in flamegraphs, where call to trace was ProgramManager::getOrCreateKernel where it originally took 2.5% of this function and now it something between 0-1.32% (closer to 0 I think)

In order example

old SYCL: 133375, new SYCL with inlines optimization: 130915, new SYCL with both optimizations: 130675

In terms of overhead reduction over UR (118028), this is:

13.00% -> 10.72% with both optimizations and
23.54% -> 10.92% with just inlines.

@sergey-semenov recently tried to enable LTO and there was almost no perf impact. That is why I am curious how much moving methods impl from cpp to hpp files improves perf.

I think compiler can do more in compilation time it if sees inlined function definitions in one source file which it analyses, than later during LTO when it has intermediate representations.

vinser52 · 2025-11-19T17:50:54Z

@lslusarczyk, can we close this PR because #20681 and #20682 were created?

_[functional change]:_ **GlobalHandler instance move** from a static variable declared inside a method to global scope. This **changes when the first object is created**. Previously it would only be created if getInstancePtr was called, now it happens at the time of loading the library. Users pay for the instance even if they never use the global handle. Motivation is performance - accessing global variable declared inside a function can be slower than to variable in global scope. Explained e.g. [here](https://stackoverflow.com/questions/52198322/is-access-to-a-static-function-variable-slower-than-access-to-a-global-variable/52202005#52202005) _[non-functional-change]_: **various functions** exising on hot submit-kernel path **get inlined** This PR is part of #20668 which was split into two. See plots in that PR desciption for performance comparisons.

lslusarczyk had a problem deploying to WindowsCILock November 18, 2025 10:04 — with GitHub Actions Failure

lslusarczyk had a problem deploying to WindowsCILock November 18, 2025 10:40 — with GitHub Actions Error

lslusarczyk force-pushed the enqueueImpKernel branch from a272143 to 9af80f1 Compare November 18, 2025 10:42

lslusarczyk had a problem deploying to WindowsCILock November 18, 2025 10:43 — with GitHub Actions Failure

[SYCL] optimize enqueueImpKernel

43f6771

lslusarczyk temporarily deployed to WindowsCILock November 18, 2025 11:14 — with GitHub Actions Inactive

lslusarczyk had a problem deploying to WindowsCILock November 18, 2025 11:14 — with GitHub Actions Failure

lslusarczyk marked this pull request as ready for review November 18, 2025 13:51

lslusarczyk requested a review from a team as a code owner November 18, 2025 13:51

lslusarczyk requested a review from vinser52 November 18, 2025 13:51

vinser52 reviewed Nov 18, 2025

View reviewed changes

sycl/source/detail/global_handler.hpp Show resolved Hide resolved

vinser52 reviewed Nov 18, 2025

View reviewed changes

vinser52 requested changes Nov 18, 2025

View reviewed changes

vinser52 requested a review from sergey-semenov November 18, 2025 15:42

lslusarczyk added 2 commits November 19, 2025 10:06

applied review comments

62b9856

fix UT

4b7cc9f

lslusarczyk force-pushed the enqueueImpKernel branch from 9af80f1 to 4b7cc9f Compare November 19, 2025 09:12

lslusarczyk temporarily deployed to WindowsCILock November 19, 2025 09:13 — with GitHub Actions Inactive

This was referenced Nov 19, 2025

[SYCL] optimize enqueueImpKernel by inlining #20681

Merged

[SYCL] optimize enqueueImpKernel by making trace faster #20682

Open

lslusarczyk temporarily deployed to WindowsCILock November 19, 2025 09:42 — with GitHub Actions Inactive

lslusarczyk had a problem deploying to WindowsCILock November 19, 2025 09:42 — with GitHub Actions Failure

lslusarczyk closed this Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize enqueueImpKernel #20668

optimize enqueueImpKernel #20668

Uh oh!

lslusarczyk commented Nov 18, 2025 •

edited

Loading

Uh oh!

lslusarczyk commented Nov 18, 2025

Uh oh!

vinser52 Nov 18, 2025

Uh oh!

vinser52 Nov 18, 2025

Uh oh!

lslusarczyk Nov 19, 2025

Uh oh!

vinser52 Nov 18, 2025

Uh oh!

lslusarczyk Nov 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

vinser52 Nov 18, 2025

Uh oh!

lslusarczyk Nov 19, 2025

Uh oh!

vinser52 left a comment •

edited

Loading

Uh oh!

lslusarczyk commented Nov 19, 2025 •

edited

Loading

Uh oh!

vinser52 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

optimize enqueueImpKernel #20668

optimize enqueueImpKernel #20668

Uh oh!

Conversation

lslusarczyk commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lslusarczyk commented Nov 18, 2025

Uh oh!

vinser52 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

vinser52 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

lslusarczyk Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

vinser52 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

lslusarczyk Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vinser52 Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

lslusarczyk Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

vinser52 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lslusarczyk commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Out of order example.

In order example

Uh oh!

vinser52 commented Nov 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lslusarczyk commented Nov 18, 2025 •

edited

Loading

lslusarczyk Nov 19, 2025 •

edited

Loading

vinser52 left a comment •

edited

Loading

lslusarczyk commented Nov 19, 2025 •

edited

Loading