PennyLaneAI · maliasadi · Jan 14, 2026 · Apr 17, 2025 · Apr 17, 2025 · May 28, 2025
diff --git a/.github/CHANGELOG.md b/.github/CHANGELOG.md
@@ -4,6 +4,10 @@
 
 <h3>Improvements 🛠</h3>
 
+- Enabled OpenMP support in `lightning.qubit` across all kernel types (LM, AVX2, and AVX512) for better performance tuning.
+  [(#1133)](https://github.com/PennyLaneAI/pennylane-lightning/pull/1133)
+
+
 <h3>Breaking changes 💔</h3>
 
 <h3>Deprecations 👋</h3>
@@ -14,7 +18,7 @@
 
 <h3>Internal changes ⚙️</h3>
 
-- Merge v0.43.0 rc branch to master.
+- Merged v0.43.0 rc branch to master.
   [(#1282)](https://github.com/PennyLaneAI/pennylane-lightning/pull/1282)
 
 - Removed Catalyst version pin in stable CI tests.
@@ -33,6 +37,7 @@
 
 This release contains contributions from (in alphabetical order):
 
+Ali Asadi,
 Joseph Lee
 
 ---

diff --git a/.github/workflows/tests_lqcpu_python.yml b/.github/workflows/tests_lqcpu_python.yml
@@ -93,7 +93,7 @@ jobs:
       - name: Create device wheel ${{ inputs.lightning-version }}
         run: |
           PL_BACKEND=${{ matrix.pl_backend }} python scripts/configure_pyproject_toml.py
-          CMAKE_ARGS="-DENABLE_BLAS=${{ matrix.blas }} -DLQ_ENABLE_KERNEL_OMP=ON -DENABLE_PYTHON=ON -DLIGHTNING_CATALYST_SRC_PATH=${{ github.workspace }}/catalyst" python -m build
+          CMAKE_ARGS="-DENABLE_BLAS=${{ matrix.blas }} -DENABLE_PYTHON=ON -DLIGHTNING_CATALYST_SRC_PATH=${{ github.workspace }}/catalyst" python -m build
           cd dist
           WHEEL_NAME=$(ls *.whl)
           cp $WHEEL_NAME ${{ github.workspace }}/wheel_${{ matrix.pl_backend }}-${{ matrix.blas }}.whl

diff --git a/doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst b/doc/lightning_qubit/development/avx_kernels/kernel_tuning.rst
@@ -1,13 +1,22 @@
 Kernel performance tuning
 #########################
 
-Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads. To enable this, we add OpenMP threading within the adjoint differentiation method implementation and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload.
+Lightning-Qubit's kernel implementations are by default tuned for high throughput single-threaded performance with gradient workloads.
+To enable this, we add OpenMP threading within the adjoint differentiation method implementation
+and use SIMD-level intrinsics to ensure fast performance for each given circuit in such a workload.
 
-However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead. For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels.
+However, sometimes we may want to modify the above defaults to favour a given workload, such as by enabling multi-threaded execution of the gate kernels instead.
+For this, we have several compile-time flags to change the operating behaviour of Lightning-Qubit kernels.
 
 OpenMP threaded kernels
 -----------------------
 
-To enable OpenMP acceleration of the gate kernels, Lightning-Qubit can be compiled with the ``-DLQ_ENABLE_KERNEL_OMP=ON`` CMake flag. Not, that for gradient workloads with many observables, this may reduce performance in comparison with the default mode, so this behaviour is opt-in only.
+OpenMP acceleration of the gate kernels across all kernel types (LM, AVX2, and AVX512) is enabled by default in Lightning-Qubit.
+You can control the number of threads used by setting the ``OMP_NUM_THREADS`` environment variable before starting your Python session,
+or if already started, before simulating your PennyLane programs.
+For gradient workloads with many observables, this may reduce performance in comparison with the default mode,
+to turn this off, use the CMake flag ``-DLQ_ENABLE_KERNEL_OMP=OFF`` when building Lightning-Qubit.
 
-For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck, and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the ``-DLQ_ENABLE_KERNEL_AVX_STREAMING=on`` CMake flag. This forces the data to avoid updating the CPU cache and can improve performance for larger workloads.
+For workloads that show benefit from the use of threaded gate kernels, sometimes updating the CPU cache to accommodate recently modified data can become a bottleneck,
+and saturates the performance gained at high thread counts. This may be alleviated somewhat on systems supporting AVX2 and AVX-512 operations using the ``-DLQ_ENABLE_KERNEL_AVX_STREAMING=on`` CMake flag.
+This forces the data to avoid updating the CPU cache and can improve performance for larger workloads.
diff --git a/doc/lightning_qubit/device.rst b/doc/lightning_qubit/device.rst
@@ -141,6 +141,15 @@ If you are computing a large number of expectation values, or if you are using a
     dev = qml.device("lightning.qubit", wires=2, batch_obs=True)
 
 
+**OpenMP acceleration of the gate kernels:**
+
+OpenMP acceleration of the gate kernels across all kernel types (LM, AVX2, and AVX512) is enabled by default in Lightning-Qubit.
+You can control the number of threads used by setting the ``OMP_NUM_THREADS`` environment variable before starting your Python session,
+or if already started, before simulating your PennyLane programs.
+
+For gradient workloads with many observables, this may reduce performance in comparison with the default mode,
+to turn this off, use the CMake flag ``-DLQ_ENABLE_KERNEL_OMP=OFF`` when building Lightning-Qubit.
+
 **Markov Chain Monte Carlo sampling support:**
 
 The ``lightning.qubit`` device allows users to use the Markov Chain Monte Carlo (MCMC) sampling method to generate approximate samples. To enable the MCMC sampling method for sample generation, initialize a ``lightning.qubit`` device with the ``mcmc=True`` keyword argument, as:

diff --git a/pennylane_lightning/core/_version.py b/pennylane_lightning/core/_version.py
@@ -16,4 +16,4 @@
 Version number (major.minor.patch[-label])
 """
 
-__version__ = "0.44.0-dev4"
+__version__ = "0.44.0-dev5"
diff --git a/pennylane_lightning/core/simulators/lightning_qubit/CMakeLists.txt b/pennylane_lightning/core/simulators/lightning_qubit/CMakeLists.txt
@@ -20,7 +20,7 @@ add_library(lightning_qubit STATIC ${LQUBIT_FILES})
 
 option(ENABLE_BLAS "Enable BLAS" OFF)
 option(ENABLE_GATE_DISPATCHER "Enable gate kernel dispatching on AVX/AVX2/AVX512" ON)
-option(LQ_ENABLE_KERNEL_OMP "Enable OpenMP pragmas for gate kernels" OFF)
+option(LQ_ENABLE_KERNEL_OMP "Enable OpenMP pragmas for gate kernels" ON)
 option(LQ_ENABLE_KERNEL_AVX_STREAMING "Enable AVX2/512 streaming operations for gate kernels" OFF)
 
 # Inform the compiler that this device is enabled.