IntelPython
diff --git a/‎docs/source/bibliography.bib
Lines changed: 26 additions & 0 deletions b/‎docs/source/bibliography.bib
Lines changed: 26 additions & 0 deletions
diff --git a/‎docs/source/conf.py
Lines changed: 3 additions & 0 deletions b/‎docs/source/conf.py
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source/ext_links.txt
Lines changed: 4 additions & 1 deletion b/‎docs/source/ext_links.txt
Lines changed: 4 additions & 1 deletion
diff --git a/‎docs/source/programming_model.old.rst
Lines changed: 277 additions & 0 deletions b/‎docs/source/programming_model.old.rst
Lines changed: 277 additions & 0 deletions
@@ -0,0 +1,26 @@
+
+
+@techreport{scott70,
+  author      = {Dana Scott},
+  institution = {OUCL},
+  month       = {November},
+  number      = {PRG02},
+  pages       = {30},
+  title       = {OUTLINE OF A MATHEMATICAL THEORY OF COMPUTATION},
+  year        = {1970}
+}
+
+@article{PLOTKIN20043,
+  abstract = {We review the origins of structural operational semantics. The main publication `A Structural Approach to Operational Semantics,' also known as the `Aarhus Notes,' appeared in 1981 [G.D. Plotkin, A structural approach to operational semantics, DAIMI FN-19, Computer Science Department, Aarhus University, 1981]. The development of the ideas dates back to the early 1970s, involving many people and building on previous work on programming languages and logic. The former included abstract syntax, the SECD machine, and the abstract interpreting machines of the Vienna school; the latter included the λ-calculus and formal systems. The initial development of structural operational semantics was for simple functional languages, more or less variations of the λ-calculus; after that the ideas were gradually extended to include languages with parallel features, such as Milner's CCS. This experience set the ground for a more systematic exposition, the subject of an invited course of lectures at Aarhus University; some of these appeared in print as the 1981 Notes. We discuss the content of these lectures and some related considerations such as `small state' versus `grand state,' structural versus compositional semantics, the influence of the Scott–Strachey approach to denotational semantics, the treatment of recursion and jumps, and static semantics. We next discuss relations with other work and some immediate further development. We conclude with an account of an old, previously unpublished, idea: an alternative, perhaps more readable, graphical presentation of systems of rules for operational semantics.},
+  author   = {Gordon D Plotkin},
+  doi      = {https://doi.org/10.1016/j.jlap.2004.03.009},
+  issn     = {1567-8326},
+  journal  = {The Journal of Logic and Algebraic Programming},
+  keywords = {Semantics of programming languages, (Structural) operational semantics, Structural induction, (Labelled) transition systems, -calculus, Concurrency, Big step semantics, Small-step semantics, Abstract machines, Static semantics},
+  note     = {Structural Operational Semantics},
+  pages    = {3-15},
+  title    = {The origins of structural operational semantics},
+  url      = {https://www.sciencedirect.com/science/article/pii/S1567832604000268},
+  volume   = {60-61},
+  year     = {2004}
+}
@@ -31,8 +31,11 @@
     "sphinxcontrib.googleanalytics",
     "myst_parser",
     "autoapi.extension",
+    "sphinxcontrib.bibtex",
 ]
 
+bibtex_bibfiles = ["bibliography.bib"]
+
 # Add any paths that contain templates here, relative to this directory.
 # templates_path = ['_templates']
 templates_path = []
 
@@ -2,7 +2,7 @@
     **********************************************************
     THESE ARE EXTERNAL PROJECT LINKS USED IN THE DOCUMENTATION
     **********************************************************
-
+.. _math: https://docs.python.org/3/library/math.html
 .. _NumPy*: https://numpy.org/
 .. _Numba*: https://numba.pydata.org/
 .. _numba-dpex: https://github.com/IntelPython/numba-dpex
@@ -14,6 +14,7 @@
 .. _SYCL*: https://www.khronos.org/sycl/
 .. _dpctl: https://intelpython.github.io/dpctl/latest/index.html
 .. _Data Parallel Control: https://intelpython.github.io/dpctl/latest/index.html
+.. _DLPack: https://dmlc.github.io/dlpack/latest/
 .. _Dpnp: https://intelpython.github.io/dpnp/
 .. _dpnp: https://intelpython.github.io/dpnp/
 .. _Data Parallel Extension for Numpy*: https://intelpython.github.io/dpnp/
@@ -28,3 +29,5 @@
 .. _oneDPL: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html#gs.5izf63
 .. _UXL: https://uxlfoundation.org/
 .. _oneAPI GPU optimization guide: https://www.intel.com/content/www/us/en/docs/oneapi/optimization-guide-gpu/2024-0/general-purpose-computing-on-gpu.html
+.. _dpctl.tensor.usm_ndarray: https://intelpython.github.io/dpctl/latest/docfiles/dpctl/usm_ndarray.html#dpctl.tensor.usm_ndarray
+.. _dpnp.ndarray: https://intelpython.github.io/dpnp/reference/ndarray.html
@@ -0,0 +1,277 @@
+.. _programming_model:
+.. include:: ./ext_links.txt
+
+Programming Model
+=================
+
+In a heterogeneous system there may be **multiple** devices a Python user may
+want to engage. For example, it is common for a consumer-grade laptop to feature
+an integrated or a discrete GPU alongside a CPU.
+
+To harness their power one needs to know how to answer the following 3 key
+questions:
+
+1. How does a Python program recognize available computational devices?
+2. How does a Python workload specify computations to be offloaded to selected
+   devices?
+3. How does a Python application manage data sharing?
+
+Recognizing available devices
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Python package ``dpctl`` answers these questions. All the computational devices
+known to the underlying DPC++ runtime can be accessed using
+``dpctl.get_devices()``. A specific device of interest `can be selected
+<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/device_selection.html>`__
+either using a helper function, e.g. ``dpctl.select_gpu_device()``, or by
+passing a filter selector string to ``dpctl.SyclDevice`` constructor.
+
+.. code:: python
+
+   import dpctl
+
+   # select a GPU device. If multiple devices present,
+   # let the underlying runtime select from GPUs
+   dev_gpu = dpctl.SyclDevice("gpu")
+   # select a CPU device
+   dev_cpu = dpctl.SyclDevice("cpu")
+
+   # stand-alone function, equivalent to C++
+   #   `auto dev = sycl::gpu_selector().select_device();`
+   dev_gpu_alt = dpctl.select_gpu_device()
+   # stand-alone function, equivalent to C++
+   #   `auto dev = sycl::cpu_selector().select_device();`
+   dev_cpu_alt = dpctl.select_cpu_device()
+
+A `device object
+<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/devices.html>`__
+can be used to query properies of the device, such as its name, vendor, maximal
+number of computational units, memory size, etc.
+
+Specifying offload target
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To answer the second question on the list we need a digression to explain
+offloading in oneAPI DPC++ first.
+
+.. note::
+   In DPC++, a computation kernel can be specified using generic C++
+   programming and then the kernel can be offloaded to any device that is
+   supported by an underlying SYCL runtime. The device to which the kernel
+   is offloaded is specified using an **execution queue** when *launching
+   the kernel*.
+
+   The oneAPI unified programming model brings portability across heterogeneous
+   architectures. Another important aspect of the programming model is its
+   inherent flexibility that makes it possible to go beyond portability and even
+   strive for performance portability. An oneAPI library may be implemented
+   using C++ techniques such as template metaprogramming or dynamic polymorphism
+   to implement specializations for a generic kernel. If a kernel is implemented
+   polymorphically, the specialized implementation will be dispatched based on
+   the execution queue specified during kernel launch. The oneMKL library is an
+   example of a performance portable oneAPI library.
+
+A computational task is offloaded for execution on a device by submitting it to
+DPC++ runtime which inserts the task in a computational graph. Once the device
+becomes available the runtime selects a task whose dependencies are met for
+execution. The computational graph as well as the device targeted by its tasks
+are stored in a `SYCL queue
+<https://intelpython.github.io/dpctl/latest/docfiles/user_guides/manual/dpctl/queues.html>`__
+object. The task submission is therefore always associated with a queue.
+
+Queues can be constructed directly from a device object, or by using a filter
+selector string to indicate the device to construct:
+
+.. code:: python
+
+   # construct queue from device object
+   q1 = dpctl.SyclQueue(dev_gpu)
+   # construct queue using filter selector
+   q2 = dpctl.SyclQueue("gpu")
+
+The computational tasks can be stored in an oneAPI native extension in which
+case their submission is orchestrated during Python API calls. Let’s consider a
+function that offloads an evaluation of a polynomial for every point of a NumPy
+array ``X``. Such a function needs to receive a queue object to indicate which
+device the computation must be offloaded to:
+
+.. code:: python
+
+   # allocate space for the result
+   Y = np.empty_like(X)
+   # evaluate polynomial on the device targeted by the queue, Y[i] = p(X[i])
+   onapi_ext.offloaded_poly_evaluate(exec_q, X, Y)
+
+Python call to ``onapi_ext.offloaded_poly_evaluate`` applied to NumPy arrays of
+double precision floating pointer numbers gets translated to the following
+sample C++ code:
+
+.. code:: cpp
+
+   void
+   cpp_offloaded_poly_evaluate(
+     sycl::queue q, const double *X, double *Y, size_t n) {
+       // create buffers from malloc allocations to make data accessible from device
+       sycl::buffer<1, double> buf_X(X, n);
+       sycl::buffer<1, double> buf_Y(Y, n);
+
+       q.submit([&](sycl::handler &cgh) {
+           // create buffer accessors indicating kernel data-flow pattern
+           sycl::accessor acc_X(buf_X, cgh, sycl::read_only);
+           sycl::accessor acc_Y(buf_Y, cgh, sycl::write_only, sycl::no_init);
+
+           cgh.parallel_for(n,
+              // lambda function that gets executed by different work-items with
+              // different arguments in parallel
+              [=](sycl::id<1> id) {
+                 auto x = accX[id];
+                 accY[id] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x));
+              });
+       }).wait();
+
+       return;
+   }
+
+We refer an interested reader to an excellent and freely available “`Data
+Parallel C++ <https://link.springer.com/book/10.1007%2F978-1-4842-5574-2>`__”
+book for details of this data parallel C++.
+
+Our package ``numba_dpex`` allows one to write kernels directly in Python.
+
+.. code:: python
+
+   import numba_dpex
+
+
+   @numba_dpex.kernel
+   def numba_dpex_poly(X, Y):
+       i = numba_dpex.get_global_id(0)
+       x = X[i]
+       Y[i] = 3.0 + x * (1.0 + x * (-0.5 + 0.3 * x))
+
+Specifying the execution queue is done using Python context manager:
+
+.. code:: python
+
+   import numpy as np
+
+   X = np.random.randn(10**6)
+   Y = np.empty_like(X)
+
+   with dpctl.device_context(q):
+       # apply the kernel to elements of X, writing value into Y,
+       # while executing using given queue
+       numba_dpex_poly[numba_dpex.Range(X.size)](X, Y)
+
+The argument to ``device_context`` can be a queue object, a device object for
+which a temporary queue will be created, or a filter selector string. Thus we
+could have equally used ``dpctl.device_context(gpu_dev)`` or
+``dpctl.device_context("gpu")``.
+
+Note that in this examples data sharing was implicitly managed for us: in the
+case of calling a function from a precompiled oneAPI native extension data
+sharing was managed by DPC++ runtime, while in the case of using ``numba_dpex``
+kernel it was managed during execution of ``__call__`` method.
+
+Data sharing
+~~~~~~~~~~~~
+
+Implicit management of data is surely convenient, but its use in an interpreted
+code comes at a performance cost. A runtime must implicitly copy data from host
+to the device before the kernel execution commences and then copy some (or all)
+of it back after the execution completes for every Python API call.
+
+``dpctl`` provides for allocating memory directly accessible to kernels
+executing on a device using SYCL’s Unified Shared Memory (`USM
+<https://www.khronos.org/registry/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:usm>`__)
+feature. It also implements USM-based ND-array object
+``dpctl.tensor.usm_ndarray`` that conforms `array-API standard
+<https://data-apis.org/array-api/latest/>`__.
+
+.. code:: python
+
+   import dpctl.tensor as dpt
+
+   # allocate array of doubles using USM-device allocation on GPU device
+   X = dpt.arange(0.0, end=1.0, step=1e-6, device="gpu", usm_type="device")
+   # allocate array for the output
+   Y = dpt.empty_like(X)
+
+   # execution queue is inferred from allocation queues.
+   # Kernel is executed on the same device where arrays were allocated
+   numba_dpex_poly[X.size, numba_dpex.DEFAULT_LOCAL_SIZE](X, Y)
+
+The execution queue can be unambiguously determined in this case since both
+arguments are USM arrays with the same allocation queues and ``X.sycl_queue ==
+Y.sycl_queue`` evaluates to ``True``. Should allocation queues be different,
+such an inference becomes ambiguous and ``numba_dpex`` raises
+``IndeterminateExecutionQueueError`` advising user to explicitly migrate the
+data.
+
+Migration can be accomplished either by using ``dpctl.tensor.asarray(X,
+device=target_device)`` to create a copy, or by using
+``X.to_device(target_device)`` method.
+
+A USM array can be copied back into a NumPy array using ``dpt.asnumpy(Y)`` if
+needed.
+
+Compute follows data
+~~~~~~~~~~~~~~~~~~~~
+
+Automatic deduction of the execution queue from allocation queues is consistent
+with “`local control for data allocation target
+<https://data-apis.org/array-api/latest/design_topics/device_support.html>`__”
+in the array API standard. User has full control over memory allocation through
+three keyword arguments present in all `array creation functions
+<https://data-apis.org/array-api/latest/API_specification/creation_functions.html>`__.
+For example, consider
+
+.. code:: python
+
+    # TODO
+
+The keyword ``device`` is `mandated by the array API
+<https://data-apis.org/array-api/latest/design_topics/device_support.html#syntax-for-device-assignment>`__.
+In ``dpctl.tensor`` the allowed values of the keyword are
+
+-  Filter selector string, e.g. ``device="gpu:0"``
+-  Existing ``dpctl.SyclDevice`` object, e.g. ``device=dev_gpu``
+-  Existing ``dpctl.SyclQueue`` object
+-  ``dpctl.tensor.Device`` object instance obtained from an existing USM array,
+   e.g. ``device=X.device``
+
+In all cases, an allocation queue object will be constructed as described
+`earlier <#specifying-offload-target>`__ and stored in the array instance,
+accessible with ``X.sycl_queue``. Instead of using ``device`` keyword, one can
+alternatively use ``sycl_queue`` keyword for readability to directly specify a
+``dpctl.SyclQueue`` object to be used as the allocation queue.
+
+The rationale for storing the allocation queue in the array is that kernels
+submitted to this queue are guaranteed to be able to correctly dereference (i.e.
+access) the USM pointer. Array operations that only involve this single USM
+array can thus execute on the allocation queue, and the output array can be
+allocated on this same allocation queue with the same usm type as the input
+array.
+
+.. note::
+   Reusing the allocation queue of the input
+   array ensures the computational tasks behind the API call can access the
+   array without making implicit copies and the output array is allocated
+   on the same device as the input.
+
+Compute follows data is the rule prescribing deduction of the execution and the
+allocation queue as well as the USM type for the result when multiple USM arrays
+are combined. It stipulates that arrays can be combined if and only if their
+allocation *queues are the same* as measured by ``==`` operator (i.e.
+``X.sycl_queue == Y.sycl_queue`` must evaluate to ``True``). Same queues refer
+to the same underlying task graphs and DPC++ schedulers.
+
+An attempt to combine USM arrays with unsame allocation queues raises an
+exception advising the user to migrate the data. Migration can be accomplished
+either by using ``dpctl.tensor.asarray(X, device=Y.device)`` to create a copy,
+or by using ``X.to_device(Y.device)`` method which can sometime do the migration
+more efficiently.
+
+.. warning::
+   ``dpctl`` and ``numba_dpex`` are both under heavy development. Feel free to file an
+   issue on GitHub or reach out on Gitter should you encounter any issues.