-
Notifications
You must be signed in to change notification settings - Fork 50
Enable OpenMP pragmas for gate kernels on Linux and MacOS Wheels #1133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 10 commits
Commits
Show all changes
29 commits
Select commit
Hold shift + click to select a range
544a5c6
LQ_ENABLE_KERNEL_OMP=ON
maliasadi b2aef63
Auto update version from '0.42.0-dev1' to '0.42.0-dev2'
ringo-but-quantum c806c06
Merge with master
maliasadi c80ec76
Merge branch 'master' into tune_lq_kernel_perf
maliasadi 53cc19e
Merge branch 'master' into tune_lq_kernel_perf
maliasadi 1c1faf7
Auto update version from '0.44.0-dev3' to '0.44.0-dev4'
ringo-but-quantum a081e86
Update changelog
maliasadi f881805
Merge with master
maliasadi 05c6f27
Update docs
maliasadi b43df85
Auto update version from '0.44.0-dev4' to '0.44.0-dev5'
ringo-but-quantum f14b140
Apply suggestions from code review
maliasadi 03c5ef4
Merge branch 'master' into tune_lq_kernel_perf
maliasadi 6b71d5c
Auto update version from '0.44.0-dev5' to '0.44.0-dev6'
ringo-but-quantum b916e50
Merge with master
maliasadi cf3a4d5
Update the scope
maliasadi c90f59e
trigger ci
maliasadi 4285358
Merge branch 'master' into tune_lq_kernel_perf
maliasadi d371ffc
Apply suggestions from code review
maliasadi a2e6d48
Auto update version from '0.44.0-dev14' to '0.44.0-dev17'
ringo-but-quantum 41b0967
Merge with master
maliasadi 63532ac
Enable Kernel OMP on Linux and MacOS wheels
maliasadi fc1d17d
Update docs
maliasadi fe71c63
Auto update version from '0.44.0-dev26' to '0.44.0-dev28'
ringo-but-quantum 03cf8e1
git mv kernel_tuning.rst
maliasadi 808620c
trigger ci
maliasadi 99a1036
Update changelog
maliasadi f97486f
Merge with master
maliasadi 05604c6
Merge branch 'master' into tune_lq_kernel_perf
maliasadi a3dbc25
Auto update version from '0.45.0-dev3' to '0.45.0-dev4'
ringo-but-quantum File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Some comments aren't visible on the classic Files Changed page.
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
17 changes: 13 additions & 4 deletions
17
doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,13 +1,22 @@ | ||
| Kernel performance tuning | ||
| ######################### | ||
|
|
||
| Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads. To enable this, we add OpenMP threading within the adjoint differentiation method implementation and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload. | ||
| Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads. | ||
maliasadi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| To enable this, we add OpenMP threading within the adjoint differentiation method implementation | ||
| and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload. | ||
|
|
||
| However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead. For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels. | ||
| However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead. | ||
| For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels. | ||
|
|
||
| OpenMP threaded kernels | ||
| ----------------------- | ||
|
|
||
| To enable OpenMP acceleration of the gate kernels, Lightning-Qubit can be compiled with the ``-DLQ_ENABLE_KERNEL_OMP=ON`` CMake flag. Not, that for gradient workloads with many observables, this may reduce performance in comparison with the default mode, so this behaviour is opt-in only. | ||
| OpenMP acceleration of the gate kernels across all kernel types (LM, AVX2, and AVX512) is enabled by default in Lightning-Qubit. | ||
| You can control the number of threads used by setting the ``OMP_NUM_THREADS`` environment variable before starting your Python session, | ||
| or if already started, before simulating your PennyLane programs. | ||
| For gradient workloads with many observables, this may reduce performance in comparison with the default mode, | ||
maliasadi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| to turn this off, use the CMake flag ``-DLQ_ENABLE_KERNEL_OMP=OFF`` when building Lightning-Qubit. | ||
maliasadi marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck, and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the ``-DLQ_ENABLE_KERNEL_AVX_STREAMING=on`` CMake flag. This forces the data to avoid updating the CPU cache and can improve performance for larger workloads. | ||
| For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck, | ||
| and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the ``-DLQ_ENABLE_KERNEL_AVX_STREAMING=on`` CMake flag. | ||
| This forces the data to avoid updating the CPU cache and can improve performance for larger workloads. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -16,4 +16,4 @@ | |
| Version number (major.minor.patch[-label]) | ||
| """ | ||
|
|
||
| __version__ = "0.44.0-dev4" | ||
| __version__ = "0.44.0-dev5" | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.