Skip to content
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
544a5c6
LQ_ENABLE_KERNEL_OMP=ON
maliasadi Apr 17, 2025
b2aef63
Auto update version from '0.42.0-dev1' to '0.42.0-dev2'
ringo-but-quantum Apr 17, 2025
c806c06
Merge with master
maliasadi May 28, 2025
c80ec76
Merge branch 'master' into tune_lq_kernel_perf
maliasadi Oct 15, 2025
53cc19e
Merge branch 'master' into tune_lq_kernel_perf
maliasadi Oct 16, 2025
1c1faf7
Auto update version from '0.44.0-dev3' to '0.44.0-dev4'
ringo-but-quantum Oct 16, 2025
a081e86
Update changelog
maliasadi Oct 16, 2025
f881805
Merge with master
maliasadi Oct 16, 2025
05c6f27
Update docs
maliasadi Oct 16, 2025
b43df85
Auto update version from '0.44.0-dev4' to '0.44.0-dev5'
ringo-but-quantum Oct 16, 2025
f14b140
Apply suggestions from code review
maliasadi Oct 16, 2025
03c5ef4
Merge branch 'master' into tune_lq_kernel_perf
maliasadi Oct 16, 2025
6b71d5c
Auto update version from '0.44.0-dev5' to '0.44.0-dev6'
ringo-but-quantum Oct 16, 2025
b916e50
Merge with master
maliasadi Oct 24, 2025
cf3a4d5
Update the scope
maliasadi Oct 24, 2025
c90f59e
trigger ci
maliasadi Oct 24, 2025
4285358
Merge branch 'master' into tune_lq_kernel_perf
maliasadi Nov 18, 2025
d371ffc
Apply suggestions from code review
maliasadi Dec 4, 2025
a2e6d48
Auto update version from '0.44.0-dev14' to '0.44.0-dev17'
ringo-but-quantum Dec 4, 2025
41b0967
Merge with master
maliasadi Jan 5, 2026
63532ac
Enable Kernel OMP on Linux and MacOS wheels
maliasadi Jan 5, 2026
fc1d17d
Update docs
maliasadi Jan 5, 2026
fe71c63
Auto update version from '0.44.0-dev26' to '0.44.0-dev28'
ringo-but-quantum Jan 5, 2026
03cf8e1
git mv kernel_tuning.rst
maliasadi Jan 5, 2026
808620c
trigger ci
maliasadi Jan 5, 2026
99a1036
Update changelog
maliasadi Jan 6, 2026
f97486f
Merge with master
maliasadi Jan 8, 2026
05604c6
Merge branch 'master' into tune_lq_kernel_perf
maliasadi Jan 14, 2026
a3dbc25
Auto update version from '0.45.0-dev3' to '0.45.0-dev4'
ringo-but-quantum Jan 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .github/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

<h3>Improvements 🛠</h3>

- Enabled OpenMP support in `lightning.qubit` across all kernel types (LM, AVX2, and AVX512) for better performance tuning.
[(#1133)](https://github.com/PennyLaneAI/pennylane-lightning/pull/1133)


<h3>Breaking changes 💔</h3>

<h3>Deprecations 👋</h3>
Expand All @@ -14,7 +18,7 @@

<h3>Internal changes ⚙️</h3>

- Merge v0.43.0 rc branch to master.
- Merged v0.43.0 rc branch to master.
[(#1282)](https://github.com/PennyLaneAI/pennylane-lightning/pull/1282)

- Removed Catalyst version pin in stable CI tests.
Expand All @@ -33,6 +37,7 @@

This release contains contributions from (in alphabetical order):

Ali Asadi,
Joseph Lee

---
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/tests_lqcpu_python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ jobs:
- name: Create device wheel ${{ inputs.lightning-version }}
run: |
PL_BACKEND=${{ matrix.pl_backend }} python scripts/configure_pyproject_toml.py
CMAKE_ARGS="-DENABLE_BLAS=${{ matrix.blas }} -DLQ_ENABLE_KERNEL_OMP=ON -DENABLE_PYTHON=ON -DLIGHTNING_CATALYST_SRC_PATH=${{ github.workspace }}/catalyst" python -m build
CMAKE_ARGS="-DENABLE_BLAS=${{ matrix.blas }} -DENABLE_PYTHON=ON -DLIGHTNING_CATALYST_SRC_PATH=${{ github.workspace }}/catalyst" python -m build
cd dist
WHEEL_NAME=$(ls *.whl)
cp $WHEEL_NAME ${{ github.workspace }}/wheel_${{ matrix.pl_backend }}-${{ matrix.blas }}.whl
Expand Down
17 changes: 13 additions & 4 deletions doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,22 @@
Kernel performance tuning
#########################

Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads. To enable this, we add OpenMP threading within the adjoint differentiation method implementation and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload.
Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads.
To enable this, we add OpenMP threading within the adjoint differentiation method implementation
and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload.

However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead. For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels.
However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead.
For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels.

OpenMP threaded kernels
-----------------------

To enable OpenMP acceleration of the gate kernels, Lightning-Qubit can be compiled with the ``-DLQ_ENABLE_KERNEL_OMP=ON`` CMake flag. Not, that for gradient workloads with many observables, this may reduce performance in comparison with the default mode, so this behaviour is opt-in only.
OpenMP acceleration of the gate kernels across all kernel types (LM, AVX2, and AVX512) is enabled by default in Lightning-Qubit.
You can control the number of threads used by setting the ``OMP_NUM_THREADS`` environment variable before starting your Python session,
or if already started, before simulating your PennyLane programs.
For gradient workloads with many observables, this may reduce performance in comparison with the default mode,
to turn this off, use the CMake flag ``-DLQ_ENABLE_KERNEL_OMP=OFF`` when building Lightning-Qubit.

For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck, and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the ``-DLQ_ENABLE_KERNEL_AVX_STREAMING=on`` CMake flag. This forces the data to avoid updating the CPU cache and can improve performance for larger workloads.
For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck,
and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the ``-DLQ_ENABLE_KERNEL_AVX_STREAMING=on`` CMake flag.
This forces the data to avoid updating the CPU cache and can improve performance for larger workloads.
9 changes: 9 additions & 0 deletions doc/lightning_qubit/device.rst
Original file line number Diff line number Diff line change
Expand Up @@ -141,6 +141,15 @@ If you are computing a large number of expectation values, or if you are using a
dev = qml.device("lightning.qubit", wires=2, batch_obs=True)


**OpenMP acceleration of the gate kernels:**

OpenMP acceleration of the gate kernels across all kernel types (LM, AVX2, and AVX512) is enabled by default in Lightning-Qubit.
You can control the number of threads used by setting the ``OMP_NUM_THREADS`` environment variable before starting your Python session,
or if already started, before simulating your PennyLane programs.

For gradient workloads with many observables, this may reduce performance in comparison with the default mode,
to turn this off, use the CMake flag ``-DLQ_ENABLE_KERNEL_OMP=OFF`` when building Lightning-Qubit.

**Markov Chain Monte Carlo sampling support:**

The ``lightning.qubit`` device allows users to use the Markov Chain Monte Carlo (MCMC) sampling method to generate approximate samples. To enable the MCMC sampling method for sample generation, initialize a ``lightning.qubit`` device with the ``mcmc=True`` keyword argument, as:
Expand Down
2 changes: 1 addition & 1 deletion pennylane_lightning/core/_version.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,4 @@
Version number (major.minor.patch[-label])
"""

__version__ = "0.44.0-dev4"
__version__ = "0.44.0-dev5"
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ add_library(lightning_qubit STATIC ${LQUBIT_FILES})

option(ENABLE_BLAS "Enable BLAS" OFF)
option(ENABLE_GATE_DISPATCHER "Enable gate kernel dispatching on AVX/AVX2/AVX512" ON)
option(LQ_ENABLE_KERNEL_OMP "Enable OpenMP pragmas for gate kernels" OFF)
option(LQ_ENABLE_KERNEL_OMP "Enable OpenMP pragmas for gate kernels" ON)
option(LQ_ENABLE_KERNEL_AVX_STREAMING "Enable AVX2/512 streaming operations for gate kernels" OFF)

# Inform the compiler that this device is enabled.
Expand Down
Loading