-
Notifications
You must be signed in to change notification settings - Fork 796
optimize enqueueImpKernel #20668
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimize enqueueImpKernel #20668
Conversation
a272143 to
9af80f1
Compare
|
There is one UT failing, which I will fix tomorrow, but I think this PR may be reviewed right now. |
| return MUsesAssert; | ||
| } | ||
|
|
||
| const std::optional<int> &getImplicitLocalArgPos() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method market a const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you just moved the implementation from the cpp to the hpp file. But maybe initially it was a mistake not marking this method as const?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, fixed, I also removed not needed const reference to small return value
| #endif | ||
| } | ||
|
|
||
| GlobalHandler *&GlobalHandler::getInstancePtr() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we keep this method in cpp file and move all others to the hpp? What will be the impact on performance?
My personal opinion is that code with a static variable declared inside the method looks clearer and isolated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if we keep this method in cpp file and move all others to the hpp? What will be the impact on performance?
I see getInstancePtr is used only in .cpp. Should be also fast. But this function is not needed any more. I removed it in favor of just using variable.
My personal opinion is that code with a static variable declared inside the method looks clearer and isolated.
You are right. However on critical path we should avoid this pattern, as it is slower than global variable outside function. See this SO answer. I worked in another project where we indeed observed how much static variable inside a method pattern can deteriorate performance.
|
|
||
| /// \return true if the instance has not been deallocated yet. | ||
| static bool isInstanceAlive(); | ||
| static GlobalHandler *&getInstancePtr() { return RTGlobalObjHandler; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it public now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my mistake, anyway I'm removing this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two parts in this PR:
- There are NFC changes when you move method implementations from cpp to hpp file.
- Optimization of tracing path, which is a functional change.
I would like to see two PR: the first is NFC changes, and the second is refactoring tracing/config reading paths. It would be easier to review, but what is more important to me is how much each PR contributes to the performance.
@sergey-semenov recently tried to enable LTO and there was almost no perf impact. That is why I am curious how much moving methods impl from cpp to hpp files improves perf.
9af80f1 to
4b7cc9f
Compare
Twor PRs created. Please review:
Inpact in performance can be best split by looking at instruction plots as time plots have high variance to reliable split impact. Second dot of short line shows impact when we remove traces optimization (just inlines optimizations). Out of order example.old SYCL: 144927, new SYCL with inlines optimization: 142057, new SYCL with both optimizations: 141817 In terms of overhead reduction over UR (117309), this is:
So inlinig has a major effect. Tracing optimization positive effect is however clearly visible in flamegraphs, where call to trace was ProgramManager::getOrCreateKernel where it originally took 2.5% of this function and now it something between 0-1.32% (closer to 0 I think) In order exampleold SYCL: 133375, new SYCL with inlines optimization: 130915, new SYCL with both optimizations: 130675 In terms of overhead reduction over UR (118028), this is:
I think compiler can do more in compilation time it if sees inlined function definitions in one source file which it analyses, than later during LTO when it has intermediate representations. |
|
@lslusarczyk, can we close this PR because #20681 and #20682 were created? |
_[functional change]:_ **GlobalHandler instance move** from a static variable declared inside a method to global scope. This **changes when the first object is created**. Previously it would only be created if getInstancePtr was called, now it happens at the time of loading the library. Users pay for the instance even if they never use the global handle. Motivation is performance - accessing global variable declared inside a function can be slower than to variable in global scope. Explained e.g. [here](https://stackoverflow.com/questions/52198322/is-access-to-a-static-function-variable-slower-than-access-to-a-global-variable/52202005#52202005) _[non-functional-change]_: **various functions** exising on hot submit-kernel path **get inlined** This PR is part of #20668 which was split into two. See plots in that PR desciption for performance comparisons.


Reduce overhead of SYCL over L0 in almost all scenarios by up to 3% (comparing to L0 baseline), or by (0-13% comparing to current overhead).
Multiple optimization on common path. Performance gain mainly by inlining most frequently executed simple methods, like traces or getters.
Performance improvements may be seen here:
https://intel.github.io/llvm/benchmarks/?runs=lslusarczyk_BMG_L0v2%2CBaseline_BMG_L0v2
They are seen in most of benchmarks. The most significant in CPU Count. But also in time plots
Sample improvements (dot-s on the right are new results from this PR):

SubmitKernel in order using events, SYCL, from 15.04us to 14.62us (L0 12.19us), that is overhead over L0 reduced from 23.3% to 19.9%
SubmitKernel out of order with completion using events long kernel, CPU count, SYCL, from 159k instr, to 155k instr (L0 132.5k) that is overhead over L0 reduced from 20.0% to 17.6%

SubmitKernel in order, CPU count, SYCL, from 133.4k instr, to 130.7k instr (L0 118.0k), that is overhead over L0 reduced from 13.0% yo 10.7%

... and much more.
Flamegraph of optimized method before:

Flamegraph after:

See especially how getOrCreateKernel became smaller. But not only this function.