diff --git a/content/en/_index.md b/content/en/_index.md
index 1188d59..f52f310 100644
--- a/content/en/_index.md
+++ b/content/en/_index.md
@@ -17,15 +17,12 @@ title: Portable Data-Parallel Python Extensions with oneAPI
diff --git a/content/en/docs/first-app.md b/content/en/docs/first-app.md
index c28b5da..16a5799 100755
--- a/content/en/docs/first-app.md
+++ b/content/en/docs/first-app.md
@@ -1,15 +1,102 @@
---
title: First DPC++ app
description: A SYCL and DPC++ "Hello, World!" example.
-date: 2017-01-05
+date: 2024-07-02
weight: 2
---
-{{% pageinfo %}}
+For an in-depth introduction to SYCL and to accelerators programming please refer to the "[Data Parallel C++](https://link.springer.com/book/10.1007/978-1-4842-9691-2)" open access e-book.
-This is a placeholder page that shows you how to use this template site.
+A SYCL application runs on SYCL platform (host, connected to one or more heterogeneous devices). The application is structured in three scopes: application scope, command group scope, and kernel scope. The kernel scope specifies a single kernel function that will be compiled by the device
+compiler and executed on the device. The command group scope specifies a unit work which includes the kernel function, preparation of
+its arguments and specifying execution ordering information. The application scope specifies all the other code outside of command group scope.
+Execution of SYCL application begins in the application scope.
-{{% /pageinfo %}}
+```cpp
+// Compile: icpx -fsycl first.cpp -o first
+#include
-Do you have any example **applications** or **code** for your users in your repo
-or elsewhere? Link to your examples here.
+int main(void) {
+ // queue to enqueue work to
+ // default-selected device
+ sycl::queue q{sycl::default_selector_v};
+
+ // allocation device
+ size_t data_size = 256;
+ int *data = sycl::malloc_device(data_size, q);
+
+ // submit a task to populate
+ // device allocation
+ sycl::event e_fill =
+ q.fill(data, 42, data_size); // built-in kernel
+
+ // submit kernel to modify device allocation
+ sycl::event e_comp =
+ q.submit([&](sycl::handler &cgh) { // command-group scope
+ // order execution after
+ // fill task completes
+ cgh.depends_on(e_fill);
+
+ sycl::range<1> global_iter_range{data_size};
+ cgh.parallel_for(
+ global_iter_range,
+ [=](sycl::item<1> it) { // kernel scope
+ int i = it.get_id(0);
+ data[i] += i;
+ }
+ );
+ });
+
+ // copy from device to host
+ // order execution after modification task completes
+ int *host_data = new int[data_size];
+
+ q.copy( // built-in kernel
+ data, host_data, data_size, {e_comp}
+ ).wait();
+ sycl::free(data, q);
+
+ // Output content of the array
+ output_array(host_data, data_size);
+ delete[] host_data;
+
+ return 0;
+}
+```
+
+The device where the kernel functions is executed is controlled by a device selector function, ``sycl::default_selector_v``.
+The default selector assigns scores to every device recognized by the runtime, and selects the one with the highest score.
+A list of devices recognized by the DPC++ runtime can be obtained by running ``sycl-ls`` command.
+
+A user of SYCL application compiled with DPC++ may restrict the set of devices discoverable by the runtime using
+``ONEAPI_DEVICE_SELECTOR`` environment variable. For example:
+
+```bash
+# execute on GPU
+ONEAPI_DEVICE_SELECTOR=*:gpu ./first
+# execute on CPU
+ONEAPI_DEVICE_SELECTOR=*:cpu ./first
+```
+
+By default, DPC++ compiler generates offload code for [SPIR64](https://www.khronos.org/spir/) SYCL target, supported by
+Intel GPUs as well as by CPU devices of x86_64 architecture. An attempt to execute SYCL program while
+selecting only devices that do not support SPIR language would result in an error.
+
+### Targeting other GPUs
+
+DPC++ supports generation of offload sections for multiple targets. For example, to compile for both SPIR and NVPTX targets (oneAPI for NVidia(R) GPUs is assumed installed):
+
+```bash
+icpx -fsycl -Xsycl-targets="nvptx64-nvidia-cuda,spir64-unknown-unknown" first.cpp -o first.out
+```
+
+To compile for both SPIR and AMD GCN targets (oneAPI for AMD GPUs is assumed installed):
+
+```bash
+icpx -fsycl -Xsycl-targets="amdgcn-amd-amdhsa,spir64-unknown-unknown" first.cpp -o first.out
+```
+
+It is possible to pass additional arguments to the specific SYCL target backend. For example, to target specific architecture use:
+
+- ``-Xsycl-target-backend=amdgcn-amd-amdhsa --offload-arch=gfx1030`` for AMD GPUs
+- ``-Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_80`` for NVidia GPUs
diff --git a/content/en/docs/kde-cpp.md b/content/en/docs/kde-cpp.md
index 6251de9..68c881b 100644
--- a/content/en/docs/kde-cpp.md
+++ b/content/en/docs/kde-cpp.md
@@ -1,81 +1,128 @@
---
title: KDE DPC++ example
description: KDE (kernel density estimation) example using SYCL and DPC++.
+date: 2024-07-02
weight: 2
---
-{{% pageinfo %}}
-
-These basic sample guidelines assume that your Docsy site is deployed using Netlify and your files are stored in GitHub. You can use the guidelines "as is" or adapt them with your own instructions: for example, other deployment options, information about your doc project's file structure, project-specific review guidelines, versioning guidelines, or any other information your users might find useful when updating your site. [Kubeflow](https://github.com/kubeflow/website/blob/master/README.md) has a great example.
-
-Don't forget to link to your own doc repo rather than our example site! Also make sure users can find these guidelines from your doc repo README: either add them there and link to them from this page, add them here and link to them from the README, or include them in both locations.
-
-{{% /pageinfo %}}
-
-We use [Hugo](https://gohugo.io/) to format and generate our website, the
-[Docsy](https://github.com/google/docsy) theme for styling and site structure,
-and [Netlify](https://www.netlify.com/) to manage the deployment of the site.
-Hugo is an open-source static site generator that provides us with templates,
-content organisation in a standard directory structure, and a website generation
-engine. You write the pages in Markdown (or HTML if you want), and Hugo wraps them up into a website.
-
-All submissions, including submissions by project members, require review. We
-use GitHub pull requests for this purpose. Consult
-[GitHub Help](https://help.github.com/articles/about-pull-requests/) for more
-information on using pull requests.
-
-## Quick start with Netlify
-
-Here's a quick guide to updating the docs. It assumes you're familiar with the
-GitHub workflow and you're happy to use the automated preview of your doc
-updates:
-
-1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) on GitHub.
-1. Make your changes and send a pull request (PR).
-1. If you're not yet ready for a review, add "WIP" to the PR name to indicate
- it's a work in progress. (**Don't** add the Hugo property
- "draft = true" to the page front matter, because that prevents the
- auto-deployment of the content preview described in the next point.)
-1. Wait for the automated PR workflow to do some checks. When it's ready,
- you should see a comment like this: **deploy/netlify — Deploy preview ready!**
-1. Click **Details** to the right of "Deploy preview ready" to see a preview
- of your updates.
-1. Continue updating your doc and pushing your changes until you're happy with
- the content.
-1. When you're ready for a review, add a comment to the PR, and remove any
- "WIP" markers.
-
-## Updating a single page
-
-If you've just spotted something you'd like to change while using the docs, Docsy has a shortcut for you:
-
-1. Click **Edit this page** in the top right hand corner of the page.
-1. If you don't already have an up to date fork of the project repo, you are prompted to get one - click **Fork this repository and propose changes** or **Update your Fork** to get an up to date version of the project to edit. The appropriate page in your fork is displayed in edit mode.
-1. Follow the rest of the [Quick start with Netlify](#quick-start-with-netlify) process above to make, preview, and propose your changes.
-
-## Previewing your changes locally
-
-If you want to run your own local Hugo server to preview your changes as you work:
-
-1. Follow the instructions in [Getting started](/docs/getting-started) to install Hugo and any other tools you need. You'll need at least **Hugo version 0.45** (we recommend using the most recent available version), and it must be the **extended** version, which supports SCSS.
-1. Fork the [Goldydocs repo](https://github.com/google/docsy-example) repo into your own project, then create a local copy using `git clone`. Don’t forget to use `--recurse-submodules` or you won’t pull down some of the code you need to generate a working site.
-
- ```
- git clone --recurse-submodules --depth 1 https://github.com/google/docsy-example.git
- ```
-
-1. Run `hugo server` in the site root directory. By default your site will be available at http://localhost:1313/. Now that you're serving your site locally, Hugo will watch for changes to the content and automatically refresh your site.
-1. Continue with the usual GitHub workflow to edit files, commit them, push the
- changes up to your fork, and create a pull request.
-
-## Creating an issue
-
-If you've found a problem in the docs, but you're not sure how to fix it yourself, please create an issue in the [Goldydocs repo](https://github.com/google/docsy-example/issues). You can also create an issue about a specific page by clicking the **Create Issue** button in the top right hand corner of the page.
-
-## Useful resources
-
-* [Docsy user guide](https://www.docsy.dev/docs/): All about Docsy, including how it manages navigation, look and feel, and multi-language support.
-* [Hugo documentation](https://gohugo.io/documentation/): Comprehensive reference for Hugo.
-* [Github Hello World!](https://guides.github.com/activities/hello-world/): A basic introduction to GitHub concepts and workflow.
-
-
+Given a sample of \\(n\\) observations \\(x_i\\) drawn from an unknown underlying continuous distribution \\(f(x)\\),
+the kernel density estimate of that density function is computed as follows, for some kernel
+smoothing parameter \\(h \in \mathbb{R}\\):
+
+$$
+ \hat{f}(x) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left(\frac{x - x_i}{h}\right)
+$$
+
+An example of NumPy code performing the estimation, for a common choice of kernel function as standard
+\\(d\\)-dimensional Gaussian distribution:
+
+
+
+```python
+def kde(poi : np.ndarray, sample : np.ndarray, h : float) -> np.ndarray:
+ """Given a sample from underlying continuous distribution and
+ a smoothing parameter `h`, evaluate density estimate at each point of
+ interest `poi`.
+ """
+ assert sample.ndim == 2
+ assert poi.ndim == 2
+ m, d1 = poi.shape
+ n, d2 = sample.shape
+ assert d1 == d2
+ assert h > 0
+ dm = np.sum(np.square(poi[:, np.newaxis, ...] - sample[np.newaxis, ...]), axis=-1)
+ return np.mean(np.exp(dm/(-2*h*h)), axis=-1)/np.power(np.sqrt(2*np.pi) * h, d1)
+```
+
+The code above evaluates \\(f(x)\\) for \\(m\\) values of points of interest \\(y_t\\).
+
+$$
+ f(y_t) = \frac{1}{n} \sum_{i=1}^{n} \frac{1}{h} K\left( \frac{1}{h^2} \left\lVert y_t - x_i \right\rVert^{2} \right), \;\;\; \forall 0 \leq t \le m
+$$
+
+Evaluating such an expression can be done in parallel. Evaluation can be done independently for each \\(t\\).
+Furthermore, summation over \\(i\\) can be partitioned among work-items, each summing \\(n_{wi}\\) distinct terms.
+Such work partitioning would generate \\(m \cdot \left\lceil {n}/{n_{wi}}\right\rceil\\) independent tasks.
+Each work-item could write its partial sum into a dedicated temporary memory location to avoid race condition
+for further summation by another kernel operating in a similar fashion.
+
+```cpp
+ parallel_for(
+ range<2>(m, ((n + n_wi - 1) / n_wi)),
+ [=](sycl::item<2> it) {
+ auto t = it.get_id(0);
+ auto i_block = it.get_id(1);
+
+ T local_partial_sum = ...;
+
+ partial_sums[t * ((n + n_wi - 1) / n_wi) + i_block] = local_partial_sum;
+ }
+ );
+```
+
+Such an approach, known as tree reduction, is implemented in ``kernel_density_esimation_temps`` function found in
+``"steps/kernel_density_estimation_cpp/kde.hpp"``.
+
+Use of temporary allocation can be avoided if each work-item atomically adds the value of the local sum to the
+appropriate zero-initialized location in the output array, as in implementation ``kernel_density_estimation_atomic_ref``
+in the same header file:
+
+```cpp
+ parallel_for(
+ range<2>(m, ((n + n_wi - 1) / n_wi)),
+ [=](sycl::item<2> it) {
+ auto t = it.get_id(0);
+ auto i_block = it.get_id(1);
+
+ T local_partial_sum = ...;
+
+ sycl::atomic_ref<...> f_aref(f[t]);
+ f_aref += local_partial_sum;
+ }
+ );
+```
+
+Multiple work-items may concurrently updating the same location in global memory would produce the correct result due to
+use of ``sycl::atomic_ref`` but at the expense of increased number of attempts, phenomenon known as atomic pressure.
+Atomic pressure leads to thread divergence and degrades performance.
+
+To reduce the atomic pressure work-items can be organized into work-groups. Every work-item in a work-group has access
+to local shared memory, dedicated on-chip memory, which can be used to cooperatively combine values held by work-items
+in the work-group without accessing the global memory. This could be done efficiently by calling group function
+``sycl::reduce_over_group``. To be able to call it, we must specify iteration range using ``sycl::nd_range`` rather than
+``sycl::range`` as we did earlier.
+
+```cpp
+ auto wg = 256; // work-group-size
+ auto n_data_per_wg = n_wi * wg;
+ auto n_groups = ((n + n_data_per_wg - 1) / n_data_per_Wg);
+
+ range<2> gRange(m, n_groups * wg);
+ range<2> lRange(1, wg);
+
+ parallel_for(
+ nd_range<2>(gRange, lRange),
+ [=](sycl::nd_item<2> it) {
+ auto t = it.get_global_id(0);
+
+ T local_partial_sum = ...;
+
+ auto work_group = it.get_group();
+ T sum_over_wg = sycl::reduce_over_group(work_group, local_sum, sycl::plus<>());
+
+ if (work_group.leader()) {
+ sycl::atomic_ref<...> f_aref(f[t]);
+ f_aref += sum_over_wg;
+ }
+ }
+ );
+```
+
+Complete implementation can be found in ``kernel_density_estimation_work_group_reduce_and_atomic_ref`` function
+in ``"steps/kernel_density_estimation_cpp/kde.hpp"``.
+
+These implementations are called from C++ application ``"steps/kernel_density_estimation_cpp/app.cpp"``, which
+samples data uniformly distributed over unit cuboid, and estimates the density using Kernel Density Estimation
+and spherically symmetric multivariate Gaussian probability density function as the kernel.
+
+The application can be built using `CMake`, or `Meson`, please refer to [README](steps/kernel_density_estimation_cpp/README.md) document in that folder.
diff --git a/content/en/docs/kde-python.md b/content/en/docs/kde-python.md
index fc39378..405bc84 100644
--- a/content/en/docs/kde-python.md
+++ b/content/en/docs/kde-python.md
@@ -1,15 +1,116 @@
---
title: KDE Python extension
description: KDE (kernel density estimation) Python extension example.
+date: 2024-07-02
weight: 3
---
-{{% pageinfo %}}
+Since SYCL builds on C++, we are going to use `pybind11` project to generate Python extension.
+We also need Python objects to carry USM allocations of input and output data, such as `dpctl` ([Data Parallel Control](https://github.com/IntelPython/dpctl.git) Python package). The `dpctl` package also provides Python objects corresponding to DPC++ runtime objects:
-This is a placeholder page that shows you how to use this template site.
+| Python object | SYCL C++ object |
+| --------------------- | ----------------- |
+| ``dpctl.SyclQueue`` | ``sycl::queue`` |
+| ``dpctl.SyclDevice`` | ``sycl::device`` |
+| ``dpctl.SyclContext`` | ``sycl::context`` |
+| ``dpctl.SyclEvent`` | ``sycl::event`` |
-{{% /pageinfo %}}
+`dpctl` provides integration with `pybind11` supporting castings between `dpctl` Python objects and corresponding C++ SYCL classes listed in the table above. Furthermore, the integration provides C++ class ``dpctl::tensor::usm_ndarray`` which derives from ``pybind11::object``.
+It stores `dpctl.tensor.usm_ndarray` object and provides methods to query its attributes, such as data pointer, dimensionality, shape, strides
+and elemental type information.
-For many projects, users may not need much information beyond the information in the [Overview](/docs/overview/), so this section is **optional**. However if there are areas where your users will need a more detailed understanding of a given term or feature in order to do anything useful with your project (or to not make mistakes when using it) put that information in this section. For example, you may want to add some conceptual pages if you have a large project with many components and a complex architecture.
+For illustration purpose, here is a sample extension source code:
-Remember to focus on what the user needs to know, not just what you think is interesting about your project! If they don’t need to understand your original design decisions to use or contribute to the project, don’t put them in, or include your design docs in your repo and link to them. Similarly, most users will probably need to know more about how features work when in use rather than how they are implemented. Consider a separate architecture page for more detailed implementation and system design information that potential project contributors can consult.
+```cpp
+#include
+#include
+#include
+#include "dpctl4pybind11.hpp"
+#include
+
+sycl::event
+py_foo(dpctl::tensor::usm_ndarray inp, dpctl::tensor::usm_ndarray out, const std::vector &deps) {
+ // validation steps skipped
+
+ // Execution queue is the queue associated with input arrays
+ // these are expected to be the same and checked during validation
+ sycl::queue exec_q = inp.get_queue();
+
+ const std::int64_t *inp_ptr = inp.get_data();
+ std::int64_t *out_ptr = out.get_data();
+
+ // submit tasks for execution and obtain the event signaling
+ // the status of execution of the set of tasks
+ sycl::event e_impl = impl_fn(exec_q, inp_ptr, out_ptr, depends);
+
+ return e_impl;
+}
+
+PYBIND11_MODULE(_ext, m) {
+ m.def("foo", &py_foo);
+}
+```
+
+On the Python side, the function would be called as follows
+
+```python
+import dpctl.tensor as dpt
+import _ext
+
+# Allocate input and output arrays on the
+# default-selected device
+inp = dpt.arange(100, dtype=dpt.int64)
+out = dpt.empty_like(inp)
+ev = _ext.foo(inp, out, [])
+
+# ...
+```
+
+Since execution is offloaded to a device, it is our responsibility to ensure that USM data being worked on
+is not deallocated until after offloaded tasks complete the execution. The simplest way to assure of this is
+to wait on the event with `ev.wait()` on the Python side, or with `e_impl.wait()` on the C++ side.
+
+Alternatively, one can assure the lifetime of arrays asynchronously, by using `sycl::handler::host_task` to
+schedule code execution on the host ordering it after kernel execution completion:
+
+```cpp
+// increment reference counts of input and output arrays
+
+sycl::event ht_ev =
+ exec_q.submit([&](sycl::handler &cgh) {
+ // execute host_task once e_impl signals completion
+ cgh.depends_on(e_impl);
+
+ cgh.host_task([=]() {
+ // we must acquire GIL to be able to safely
+ // manipulate reference counts of Python objects
+ pybind11::gil_scoped_acquire guard;
+
+ // decrement ref-counts
+ });
+ });
+```
+
+Since the host task may execute in a thread different from that of the Python interpreter (the main thread), care must be taken
+to avoid deadlocks: synchronization operation that call `ht_ev.wait()` from the main thread must release GIL to afford the body
+of the host task a chance at execution.
+
+Of course, if USM memory is not managed by Python, it may be possible to avoid using GIL altogether.
+
+An example of Python extension `"kde_sycl_ext"` that exposes kernel density estimation code from previous
+section can be found in `"steps/sycl_python_extension"` folder (see [README](steps/sycl_python_extension/README.md)).
+
+The folder contains comparison between `dpctl`-based implementation of the KDE implementation following the NumPy
+implementation [above](#kde_numpy) and the dedicated C++ code:
+
+```
+KDE for n_sample = 1000000, n_est = 17, n_dim = 7, h = 0.05
+Result agreed.
+kde_dpctl took 0.3404452269896865 seconds
+kde_ext[mode=0] 0.02209925901843235 seconds
+kde_ext[mode=1] 0.02560457994695753 seconds
+kde_ext[mode=2] 0.02815118699800223 seconds
+kde_numpy 0.7227164240321144 seconds
+```
+
+This sample run was obtained on a laptop with 11th Gen Intel(R) Core(TM) i7-1185G7 CPU @ 3.00GHz, 32 GB of RAM, and the integrated Intel(R) Iris(R) Xe GPU, with stock NumPy 1.26.4, and development build of dpctl 0.17 built with oneAPI DPC++ 2024.1.0.
diff --git a/content/en/docs/oneAPI.md b/content/en/docs/oneAPI.md
index be15072..6dd56c5 100644
--- a/content/en/docs/oneAPI.md
+++ b/content/en/docs/oneAPI.md
@@ -1,43 +1,34 @@
---
title: About oneAPI
linkTitle: About oneAPI
-description: A brief overview of oneAPI and the programming model.
+description: A brief overview of oneAPI and the programming model
+date: 2024-07-02
weight: 1
---
-The Overview is where your users find out about your project. Depending on the
-size of your docset, you can have a separate overview page (like this one) or
-put your overview contents in the Documentation landing page (like in the Docsy
-User Guide).
+The Unified Acceleration Foundation ([UXL](https://uxlfoundation.org/)) under the umbrella of Linux Foundation is driving an open standard accelerator software ecosystem that includes compilers and performance libraries. This software ecosystem standardizes programming of different types of accelerators, such as multi-core CPUs, GPUs, some FPGAs, etc. from different vendors.
-Try answering these questions for your user in this page:
+Intel's oneAPI DPC++ compiler is an implementation of the [SYCL-2020 standard](https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html) that is a part of UXL foundation's overall language design standardization for accelerator programming in C++. The compiler is being developed in https://github.com/intel/llvm, and supports offloading to Intel(R) GPUs, NVidia(R) GPUs, and AMD GPUs.
-## What is it?
+The oneAPI DPC++ compiler can be installed on Linux and Windows as part of [oneAPI Base Toolkit](https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html) and requires a system compiler to be installed, such as GCC on Linux and MSVC on Windows. Since DPC++ builds
+on the conventional toolchain, Python extensions built with DPC++ are compatible with the Python stack out of the box. DPC++ also requires that OpenCL and Level Zero drivers are installed. For NVidia(R) or AMD devices, appropriate drivers must be installed.
-Introduce your project, including what it does or lets you do, why you would use
-it, and its primary goal (and how it achieves it). This should be similar to
-your README description, though you can go into a little more detail here if you
-want.
+To offload to devices, host code and device code are compiled separately, then combined by the linker into a single fat binary.
-## Why do I want it?
+Support for offloading to GPUs from vendors other than Intel can be enabled by additionally installing plugins from CodePlay:
-Help your user know if your project will help them. Useful information can
-include:
+- [oneAPI for NVidia(R) GPUs](https://developer.codeplay.com/products/oneapi/nvidia/home/)
+- [oneAPI for AMD GPUs](https://developer.codeplay.com/products/oneapi/amd/home/)
-- **What is it good for?**: What types of problems does your project solve? What
- are the benefits of using it?
+Using these plugins and the appropriate compiler options, the fat binary for a DPC++ application will also include the necessary sections for offloading a kernel to NVidia(R) or AMD GPUs.
-- **What is it not good for?**: For example, point out situations that might
- intuitively seem suited for your project, but aren't for some reason. Also
- mention known limitations, scaling issues, or anything else that might let
- your users know if the project is not for them.
+A list of devices available to DPC++ can be obtained using ``sycl-ls``.
-- **What is it _not yet_ good for?**: Highlight any useful features that are
- coming soon.
+```bash
+sycl-ls
+[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2024.17.3.0.08_160000]
+[opencl:cpu:1] Intel(R) OpenCL, 12th Gen Intel(R) Core(TM) i9-12900 OpenCL 3.0 (Build 0) [2024.17.3.0.08_160000]
+[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce GT 1030 6.1 [CUDA 12.2]
+```
-## Where should I go next?
-
-Give your users next steps from the Overview. For example:
-
-- [Getting Started](/docs/getting-started/): Get started with $project
-- [Examples](/docs/examples/): Check out some example code!
+A list of SYCL projects can be found on https://sycl.tech/projects
diff --git a/content/en/docs/oneMKL.md b/content/en/docs/oneMKL.md
index 613e004..e97de98 100755
--- a/content/en/docs/oneMKL.md
+++ b/content/en/docs/oneMKL.md
@@ -1,15 +1,27 @@
---
title: oneMKL Python extension
description: A Python extension written using oneMKL interfaces.
-date: 2017-01-05
+date: 2024-07-02
weight: 4
---
-{{% pageinfo %}}
+Since `dpctl.tensor.usm_ndarray` is a Python object carrying a USM allocation, it is possible to write extensions which wrap `oneAPI Math Kernel Library Interfaces` ([oneMKL Interfaces](https://github.com/oneapi-src/oneMKL)) routines and then call them on the USM data underlying the `usm_ndarray` container from Python.
-This is a placeholder page that shows you how to use this template site.
+For an example routine from the `oneMKL` documentation, take [`geqrf`](https://spec.oneapi.io/versions/latest/elements/oneMKL/source/domains/lapack/geqrf.html#geqrf-usm-version):
+```cpp
+namespace oneapi::mkl::lapack {
+ cl::sycl::event geqrf(cl::sycl::queue &queue,
+ std::int64_t m,
+ std::int64_t n,
+ T *a,
+ std::int64_t lda,
+ T *tau,
+ T *scratchpad,
+ std::int64_t scratchpad_size,
+ const std::vector &events = {})
+}
+```
-{{% /pageinfo %}}
+The `pybind11` castings discussed in the previous section enable us to write a simple wrapper function for this routine with `dpctl::tensor::usm_ndarray` inputs and outputs, so long as we take the same precautions to avoid deadlocks. As a result, we can write the extension in much the same way as the `kde_sycl_ext` extension in the previous chapter.
-Do you have any example **applications** or **code** for your users in your repo
-or elsewhere? Link to your examples here.
+An example of a Python extension "mkl_interface_ext" that uses `oneMKL` calls to implement a QR decomposition can be found in "steps/mkl_interface" folder (see [README](steps/mkl_interface/README.md)).
diff --git a/hugo.yaml b/hugo.yaml
index cd0957c..2a65c84 100644
--- a/hugo.yaml
+++ b/hugo.yaml
@@ -77,6 +77,10 @@ params:
# set taxonomyPageHeader = [] to hide taxonomies on the page headers
taxonomyPageHeader: []
+params:
+ katex:
+ enable: true
+
# First one is picked as the Twitter card image if not set on page.
# images: [images/project-illustration.png]