Skip to content

Commit 2780370

Browse files
authored
Merge pull request #34 from IntelPython/doc_update
Documentation improvements according to Tech Writer team suggestions
2 parents d1c1643 + d8e9abb commit 2780370

13 files changed

+147
-2647
lines changed
Lines changed: 35 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
.. _heterogeneous_computing:
22
.. include:: ./ext_links.txt
33

4-
Heterogeneous computing
4+
Heterogeneous Computing
55
=======================
66

77
Device Offload
88
**************
99

10-
Python is an interpreted language, which implies that most of Python codes will run on CPU,
10+
Python is an interpreted language, which implies that most of the Python script will run on CPU,
1111
and only a few data parallel regions will execute on data parallel devices.
12-
That is why the concept of host and offload devices is useful when it comes to conceptualizing
12+
That is why the concept of the host and offload devices is helpful when it comes to conceptualizing
1313
a heterogeneous programming model in Python.
1414

1515
.. image:: ./_images/hetero-devices.png
@@ -19,76 +19,77 @@ a heterogeneous programming model in Python.
1919

2020
The above diagram illustrates the *host* (the CPU which runs Python interpreter) and three *devices*
2121
(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python**
22-
offer a programming model where a script executed by Python interpreter on host can *offload* data
23-
parallel kernels to user-specified device. A *kernel* is the *data parallel region* of a program submitted
24-
for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels*.
22+
offer a programming model where a script executed by Python interpreter on the host can *offload* data-parallel
23+
kernels to a user-specified device. A *kernel* is the *data-parallel region* of a program submitted
24+
for execution on the device. There can be multiple data-parallel regions, and hence multiple *offload kernels*.
2525

26-
Kernels can be pre-compiled into a library, such as ``dpnp``, or, alternatively, directly coded
26+
Kernels can be pre-compiled into a library, such as ``dpnp``, or directly coded
2727
in a programming language for heterogeneous computing, such as `OpenCl*`_ or `DPC++`_ .
2828
**Data Parallel Extensions for Python** offer the way of writing kernels directly in Python
2929
using `Numba*`_ compiler along with ``numba-dpex``, the `Data Parallel Extension for Numba*`_.
3030

3131
One or more kernels are submitted for execution into a *queue* targeting an *offload device*.
32-
For each device one or more queues can be created. In most cases you won’t need to work
32+
For each device, you can create one or more queues. In most cases, you do not need to work
3333
with device queues directly. Data Parallel Extensions for Python will do necessary underlying
3434
work with queues for you through the :ref:`Compute-Follows-Data`.
3535

3636
Unified Shared Memory
3737
*********************
3838

39-
Each device has its own memory, not necessarily accessible from another device.
39+
Each device has its memory, not necessarily accessible from another device.
4040

4141
.. image:: ./_images/hetero-devices.png
4242
:width: 600px
4343
:align: center
4444
:alt: SIMD
4545

46-
For example, **Device 1** memory may not be directly accessible from the host, but only accessible
46+
For example, **Device 1** memory may not be directly accessible from the host but accessible
4747
via expensive copying by a driver software. Similarly, depending on the architecture, direct data
48-
exchange between **Device 2** and **Device 1** may be impossible, and only possible via expensive
48+
exchange between **Device 2** and **Device 1** may be only impossible possible via expensive
4949
copying through the host memory. These aspects must be taken into consideration when programming
5050
data parallel devices.
5151

52-
In the above illustration the **Device 2** logically consists of two sub-devices, **Sub-Device 1**
52+
On the illustration above the **Device 2** logically consists of two sub-devices: **Sub-Device 1**
5353
and **Sub-Device 2**. The programming model allows accessing **Device 2** as a single logical device, or
5454
by working with each individual sub-devices. For the former case a programmer needs to create
5555
a queue for **Device 2**. For the latter case a programmer needs to create 2 queues, one for each sub-device.
5656

5757
`SYCL*`_ standard introduces a concept of the *Unified Shared Memory* (USM). USM requires hardware support
5858
for unified virtual address space, which allows coherency between the host and the device
59-
pointers. All memory is allocated by the host, but it offers three distinct allocation types:
59+
pointers. The host allocates all memory, but offers three distinct allocation types:
6060

6161
* **Host: located on the host, accessible by the host or device.** This type of memory is useful in a situation
62-
when you need to stream a read-only data from the host to the device once.
62+
when you need to stream read-only data from the host to the device once.
6363

64-
* **Device: located on the device, accessibly only by the device.** This type of memory is the fastest one.
65-
Useful in a situation when most of data crunching happens on the device.
64+
* **Device: located on the device, accessible by device only.** The fastest type of memory.
65+
Useful in a situation when most of the data crunching happens on the device.
6666

67-
* **Shared: location is both host and device (copies are synchronized by underlying software), accessible by
68-
the host or device.** Shared allocations are useful when data are accessed by both host and devices,
69-
since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type.
67+
* **Shared: location is both host and device, accessible by the host and device**.
68+
Shared allocations are useful when both host and device access data,
69+
since a user does not need to manage data migration explicitly.
70+
However, it is much slower than the USM Device memory type.
7071

7172
Compute-Follows-Data
7273
********************
7374
Since data copying between devices is typically very expensive, for performance reasons it is essential
7475
to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data* programming model,
75-
which states that the compute will happen where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
76+
which states that the compute happens where the data resides. Tensors implemented in ``dpctl`` and ``dpnp``
7677
carry information about allocation queues, and hence, about the device on which an array is allocated.
77-
Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place.
78+
Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution happens.
7879

7980
.. image:: ./_images/kernel-queue-device.png
8081
:width: 600px
8182
:align: center
8283
:alt: SIMD
8384

84-
The above picture illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
85+
The picture above illustrates the *Compute-Follows-Data* concept. Arrays ``A`` and ``B`` are inputs to the
8586
**Offload Kernel**. These arrays carry information about their *allocation queue* (**Device Queue**) and the
8687
*device* (**Device 1**) where they were created. According to the Compute-Follows-Data paradigm
8788
the **Offload Kernel** will be submitted to this **Device Queue**, and the resulting array ``C`` will
8889
be created on the **Device Queue** associated with the **Device 1**.
8990

90-
**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue,
91-
otherwise an exception will be thrown. For example, the following usages will result in the exception.
91+
**Data Parallel Extensions for Python** require all input tensor arguments to have the **same** allocation queue.
92+
Otherwise, an exception is thrown. For example, the following usages will result in an exception.
9293

9394
.. figure:: ./_images/queue-exception1.png
9495
:width: 600px
@@ -102,7 +103,7 @@ otherwise an exception will be thrown. For example, the following usages will re
102103
:align: center
103104
:alt: SIMD
104105

105-
Input tensors are on the same device but queues are different. Exception is thrown.
106+
Input tensors are on the same device, but queues are different. Exception is thrown.
106107

107108
.. figure:: ./_images/queue-exception3.png
108109
:width: 600px
@@ -111,19 +112,19 @@ otherwise an exception will be thrown. For example, the following usages will re
111112

112113
Data belongs to the same device, but queues are different and associated with different sub-devices.
113114

114-
Copying data between devices and queues
115+
Copying Data Between Devices and Queues
115116
***************************************
116117

117-
**Data Parallel Extensions for Python** create **one** *canonical queue* per device so that in
118-
normal circumstances you do not need to directly manage queues. Having one canonical queue per device
119-
allows you to copy data between devices using to_device() method:
118+
**Data Parallel Extensions for Python** create **one** *canonical queue* per device. Normally,
119+
you do not need to directly manage queues. Having one canonical queue per device
120+
allows you to copy data between devices using the ``to_device()`` method:
120121

121122
.. code-block:: python
122123
123124
a_new = a.to_device(b.device)
124125
125-
Array ``a`` will be copied to the device associated with array ``b`` into the new array ``a_new``.
126-
The same queue will be associated with ``b`` and ``a_new``.
126+
Array ``a`` is copied to the device associated with array ``b`` into the new array ``a_new``.
127+
The same queue is associated with ``b`` and ``a_new``.
127128

128129
Alternatively, you can do this as follows:
129130

@@ -137,13 +138,11 @@ Alternatively, you can do this as follows:
137138
138139
a_new = dpctl.tensor.asarray(a, device=b.device)
139140
140-
Creating additional queues
141+
Creating Additional Queues
141142
**************************
142143

143-
As previously indicated **Data Parallel Extensions for Python** automatically create one canonical queue per device,
144+
As said before, **Data Parallel Extensions for Python** automatically creates one canonical queue per device,
144145
and you normally work with this queue implicitly. However, you can always create as many additional queues per device
145-
as needed, and work with them explicitly.
146+
as needed and work explicitly with them, for example, for profiling purposes.
146147

147-
A typical situation when you will want to create the queue explicitly is for profiling purposes.
148148
Read `Data Parallel Control`_ documentation for more details about queues.
149-

docs/sources/index.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,13 +10,13 @@ Data Parallel Extensions for Python
1010
===================================
1111

1212
Data Parallel Extensions for Python* extend numerical Python capabilities beyond CPU and allow even higher performance
13-
gains on data parallel devices such as GPUs. It consists of three foundational packages:
13+
gains on data parallel devices, such as GPUs. It consists of three foundational packages:
1414

1515
* **dpnp** - Data Parallel Extensions for `Numpy*`_ - a library that implements a subset of
1616
Numpy that can be executed on any data parallel device. The subset is a drop-in replacement
1717
of core Numpy functions and numerical data types.
18-
* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - extension for Numba compiler
19-
that enables programming data parallel devices the same way you program CPU with Numba.
18+
* **numba_dpex** - Data Parallel Extensions for `Numba*`_ - an extension for Numba compiler
19+
that lets you program data-parallel devices as you program CPU with Numba.
2020
* **dpctl - Data Parallel Control library** that provides utilities for device selection,
2121
allocation of data on devices, tensor data structure along with `Python* Array API Standard`_ implementation, and support for creation of user-defined data-parallel extensions.
2222

docs/sources/parallelism.rst

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,40 @@
11
.. _parallelism:
22
.. include:: ./ext_links.txt
33

4-
Parallelism in modern data parallel architectures
4+
Parallelism in Modern Data-Parallel Architectures
55
=================================================
66

77
Python is loved for its productivity and interactivity. But when it comes to dealing with
8-
computationally heavy codes Python performance cannot be compromised. Intel and Python numerical
9-
computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicated attention to
8+
computationally heavy codes, Python performance cannot be compromised. Intel and Python numerical
9+
computing communities, such as `NumFOCUS <https://numfocus.org/>`_, dedicate attention to
1010
optimizing core numerical and data science packages for leveraging parallelism available in modern CPUs:
1111

12-
* **Multiple computational cores:** Several computational cores allow processing data concurrently.
13-
Compared to a single core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
14-
reduce a computation time *N* times for a fixed amount of data.
12+
* **Multiple computational cores:** Several computational cores allow to process the data concurrently.
13+
Compared to a single-core CPU, *N* cores can process either *N* times bigger data in a fixed time, or
14+
reduce a computation time *N* times for a set amount of data.
1515

1616
.. image:: ./_images/dpep-cores.png
1717
:width: 600px
1818
:align: center
1919
:alt: Multiple CPU Cores
2020

2121
* **SIMD parallelism:** SIMD (Single Instruction Multiple Data) is a special type of instructions
22-
that perform operations on vectors of data elements at the same time. The size of vectors is called SIMD width.
23-
If SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
22+
that perform operations on vectors of data elements at the same time. The size of vectors is called the SIMD width.
23+
If a SIMD width is *K* then a SIMD instruction can process *K* data elements in parallel.
2424

25-
In the following diagram the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
25+
In the following diagram, the SIMD width is 2, which means that a single instruction processes two elements simultaneously.
2626
Compared to regular instructions that process one element at a time, 2-wide SIMD instruction performs
27-
2 times more data in fixed time, or, respectively, process a fixed amount of data 2 times faster.
27+
two times more data in fixed time, or, respectively, process a fixed amount of data two times faster.
2828

2929
.. image:: ./_images/dpep-simd.png
3030
:width: 150px
3131
:align: center
3232
:alt: SIMD
3333

3434
* **Instruction-Level Parallelism:** Modern CISC architectures, such as x86, allow performing data independent
35-
instructions in parallel. In the following example, we compute :math:`a * b + (c - d)`.
35+
instructions in parallel. In the following example, see how to compute :math:`a * b + (c - d)`.
3636
Operations :math:`*` and :math:`-` can be executed in parallel, the last instruction
37-
:math:`+` depends on availability of :math:`a * b` and :math:`c - d` and hence cannot be executed in parallel
37+
:math:`+` depends on availability of :math:`a * b` and :math:`c - d` and cannot be executed in parallel
3838
with :math:`*` and :math:`-`.
3939

4040
.. image:: ./_images/dpep-ilp.png

docs/sources/prerequisites_and_installation.rst

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -5,15 +5,15 @@
55

66
.. |trade| unicode:: U+2122
77

8-
Prerequisites and installation
8+
Prerequisites and Installation
99
==============================
1010

11-
1. Device drivers
11+
1. Device Drivers
1212
******************
1313

14-
Since you are about to start programming data parallel devices beyond CPU, you will need an appropriate hardware.
15-
For example, Data Parallel Extensions for Python work fine on Intel laptops with integrated graphics.
16-
In majority of cases your laptop already has all necessary device drivers installed. But if you want the most
14+
To start programming data parallel devices beyond CPU, you will need an appropriate hardware.
15+
For example, Data Parallel Extensions for Python work fine on Intel |copy| laptops with integrated graphics.
16+
In majority of cases, your Windows*-based laptop already has all necessary device drivers installed. But if you want the most
1717
up-to-date driver, you can always
1818
`update it to the latest one <https://www.intel.com/content/www/us/en/download-center/home.html>`_.
1919
Follow device driver installation instructions
@@ -22,29 +22,29 @@ to complete this step.
2222
All other necessary components for programming data parallel devices will be installed with
2323
Data Parallel Extensions for Python.
2424

25-
2. Python interpreter
25+
2. Python Interpreter
2626
**********************
2727

2828
You will need Python 3.8, 3.9, or 3.10 installed on your system. If you do not have one yet the easiest way to do
2929
that is to install `Intel Distribution for Python*`_.
30-
It will install all essential Python numerical and machine
31-
learning packages optimized for Intel hardware, including Data Parallel Extensions for Python*.
30+
It installs all essential Python numerical and machine
31+
learning packages optimized for the Intel hardware, including Data Parallel Extensions for Python*.
3232
If you have Python installation from another vendor, it is fine too. All you need is to install Data Parallel
33-
Extensions for Python manually.
33+
Extensions for Python manually as shown in the next section.
3434

3535
3. Data Parallel Extensions for Python
3636
***************************************
3737

38-
You can skip this step if you already installed Intel |copy| Distribution for Python or Intel |copy| AI Analytics Toolkit.
38+
Skip this step if you already installed Intel |copy| Distribution for Python.
3939

4040
The easiest way to install Data Parallel Extensions for Python is to install numba-dpex:
4141

42-
Conda: ``conda install numba-dpex``
42+
* Conda: ``conda install numba-dpex``
4343

44-
Pip: ``pip install numba-dpex``
44+
* Pip: ``pip install numba-dpex``
4545

46-
The above commands will install ``numba-dpex`` along with its dependencies, including ``dpnp``, ``dpctl``,
47-
and required compiler runtimes and drivers.
46+
These commands install ``numba-dpex`` along with its dependencies, including ``dpnp``, ``dpctl``,
47+
and required compiler runtimes.
4848

4949
.. WARNING::
50-
Before installing with conda or pip it is strongly advised to update ``conda`` and ``pip`` to latest versions
50+
Before installing with conda or pip it is strongly advised to update ``conda`` and ``pip`` to latest versions

0 commit comments

Comments
 (0)