11.. _heterogeneous_computing :
22.. include :: ./ext_links.txt
33
4- Heterogeneous computing
4+ Heterogeneous Computing
55=======================
66
77Device Offload
88**************
99
10- Python is an interpreted language, which implies that most of Python codes will run on CPU,
10+ Python is an interpreted language, which implies that most of the Python script will run on CPU,
1111and only a few data parallel regions will execute on data parallel devices.
12- That is why the concept of host and offload devices is useful when it comes to conceptualizing
12+ That is why the concept of the host and offload devices is helpful when it comes to conceptualizing
1313a heterogeneous programming model in Python.
1414
1515.. image :: ./_images/hetero-devices.png
@@ -19,76 +19,77 @@ a heterogeneous programming model in Python.
1919
2020The above diagram illustrates the *host * (the CPU which runs Python interpreter) and three *devices *
2121(two GPU devices and one attached accelerator device). **Data Parallel Extensions for Python **
22- offer a programming model where a script executed by Python interpreter on host can *offload * data
23- parallel kernels to user-specified device. A *kernel * is the *data parallel region * of a program submitted
24- for execution on the device. There can be multiple data parallel regions, and hence multiple *offload kernels *.
22+ offer a programming model where a script executed by Python interpreter on the host can *offload * data-parallel
23+ kernels to a user-specified device. A *kernel * is the *data- parallel region * of a program submitted
24+ for execution on the device. There can be multiple data- parallel regions, and hence multiple *offload kernels *.
2525
26- Kernels can be pre-compiled into a library, such as ``dpnp ``, or, alternatively, directly coded
26+ Kernels can be pre-compiled into a library, such as ``dpnp ``, or directly coded
2727in a programming language for heterogeneous computing, such as `OpenCl* `_ or `DPC++ `_ .
2828**Data Parallel Extensions for Python ** offer the way of writing kernels directly in Python
2929using `Numba* `_ compiler along with ``numba-dpex ``, the `Data Parallel Extension for Numba* `_.
3030
3131One or more kernels are submitted for execution into a *queue * targeting an *offload device *.
32- For each device one or more queues can be created . In most cases you won’t need to work
32+ For each device, you can create one or more queues. In most cases, you do not need to work
3333with device queues directly. Data Parallel Extensions for Python will do necessary underlying
3434work with queues for you through the :ref: `Compute-Follows-Data `.
3535
3636Unified Shared Memory
3737*********************
3838
39- Each device has its own memory, not necessarily accessible from another device.
39+ Each device has its memory, not necessarily accessible from another device.
4040
4141.. image :: ./_images/hetero-devices.png
4242 :width: 600px
4343 :align: center
4444 :alt: SIMD
4545
46- For example, **Device 1 ** memory may not be directly accessible from the host, but only accessible
46+ For example, **Device 1 ** memory may not be directly accessible from the host but accessible
4747via expensive copying by a driver software. Similarly, depending on the architecture, direct data
48- exchange between **Device 2 ** and **Device 1 ** may be impossible, and only possible via expensive
48+ exchange between **Device 2 ** and **Device 1 ** may be only impossible possible via expensive
4949copying through the host memory. These aspects must be taken into consideration when programming
5050data parallel devices.
5151
52- In the above illustration the **Device 2 ** logically consists of two sub-devices, **Sub-Device 1 **
52+ On the illustration above the **Device 2 ** logically consists of two sub-devices: **Sub-Device 1 **
5353and **Sub-Device 2 **. The programming model allows accessing **Device 2 ** as a single logical device, or
5454by working with each individual sub-devices. For the former case a programmer needs to create
5555a queue for **Device 2 **. For the latter case a programmer needs to create 2 queues, one for each sub-device.
5656
5757`SYCL* `_ standard introduces a concept of the *Unified Shared Memory * (USM). USM requires hardware support
5858for unified virtual address space, which allows coherency between the host and the device
59- pointers. All memory is allocated by the host , but it offers three distinct allocation types:
59+ pointers. The host allocates all memory , but offers three distinct allocation types:
6060
6161* **Host: located on the host, accessible by the host or device. ** This type of memory is useful in a situation
62- when you need to stream a read-only data from the host to the device once.
62+ when you need to stream read-only data from the host to the device once.
6363
64- * **Device: located on the device, accessibly only by the device. ** This type of memory is the fastest one .
65- Useful in a situation when most of data crunching happens on the device.
64+ * **Device: located on the device, accessible by device only . ** The fastest type of memory.
65+ Useful in a situation when most of the data crunching happens on the device.
6666
67- * **Shared: location is both host and device (copies are synchronized by underlying software), accessible by
68- the host or device. ** Shared allocations are useful when data are accessed by both host and devices,
69- since a user does not need to explicitly manage data migration. However, it is much slower than USM Device memory type.
67+ * **Shared: location is both host and device, accessible by the host and device **.
68+ Shared allocations are useful when both host and device access data,
69+ since a user does not need to manage data migration explicitly.
70+ However, it is much slower than the USM Device memory type.
7071
7172Compute-Follows-Data
7273********************
7374Since data copying between devices is typically very expensive, for performance reasons it is essential
7475to process data close to where it is allocated. This is the premise of the *Compute-Follows-Data * programming model,
75- which states that the compute will happen where the data resides. Tensors implemented in ``dpctl `` and ``dpnp ``
76+ which states that the compute happens where the data resides. Tensors implemented in ``dpctl `` and ``dpnp ``
7677carry information about allocation queues, and hence, about the device on which an array is allocated.
77- Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution takes place .
78+ Based on tensor input arguments of the offload kernel, it deduces the queue on which the execution happens .
7879
7980.. image :: ./_images/kernel-queue-device.png
8081 :width: 600px
8182 :align: center
8283 :alt: SIMD
8384
84- The above picture illustrates the *Compute-Follows-Data * concept. Arrays ``A `` and ``B `` are inputs to the
85+ The picture above illustrates the *Compute-Follows-Data * concept. Arrays ``A `` and ``B `` are inputs to the
8586**Offload Kernel **. These arrays carry information about their *allocation queue * (**Device Queue **) and the
8687*device * (**Device 1 **) where they were created. According to the Compute-Follows-Data paradigm
8788the **Offload Kernel ** will be submitted to this **Device Queue **, and the resulting array ``C `` will
8889be created on the **Device Queue ** associated with the **Device 1 **.
8990
90- **Data Parallel Extensions for Python ** require all input tensor arguments to have the **same ** allocation queue,
91- otherwise an exception will be thrown. For example, the following usages will result in the exception.
91+ **Data Parallel Extensions for Python ** require all input tensor arguments to have the **same ** allocation queue.
92+ Otherwise, an exception is thrown. For example, the following usages will result in an exception.
9293
9394.. figure :: ./_images/queue-exception1.png
9495 :width: 600px
@@ -102,7 +103,7 @@ otherwise an exception will be thrown. For example, the following usages will re
102103 :align: center
103104 :alt: SIMD
104105
105- Input tensors are on the same device but queues are different. Exception is thrown.
106+ Input tensors are on the same device, but queues are different. Exception is thrown.
106107
107108.. figure :: ./_images/queue-exception3.png
108109 :width: 600px
@@ -111,19 +112,19 @@ otherwise an exception will be thrown. For example, the following usages will re
111112
112113 Data belongs to the same device, but queues are different and associated with different sub-devices.
113114
114- Copying data between devices and queues
115+ Copying Data Between Devices and Queues
115116***************************************
116117
117- **Data Parallel Extensions for Python ** create **one ** *canonical queue * per device so that in
118- normal circumstances you do not need to directly manage queues. Having one canonical queue per device
119- allows you to copy data between devices using to_device() method:
118+ **Data Parallel Extensions for Python ** create **one ** *canonical queue * per device. Normally,
119+ you do not need to directly manage queues. Having one canonical queue per device
120+ allows you to copy data between devices using the `` to_device() `` method:
120121
121122.. code-block :: python
122123
123124 a_new = a.to_device(b.device)
124125
125- Array ``a `` will be copied to the device associated with array ``b `` into the new array ``a_new ``.
126- The same queue will be associated with ``b `` and ``a_new ``.
126+ Array ``a `` is copied to the device associated with array ``b `` into the new array ``a_new ``.
127+ The same queue is associated with ``b `` and ``a_new ``.
127128
128129Alternatively, you can do this as follows:
129130
@@ -137,13 +138,11 @@ Alternatively, you can do this as follows:
137138
138139 a_new = dpctl.tensor.asarray(a, device = b.device)
139140
140- Creating additional queues
141+ Creating Additional Queues
141142**************************
142143
143- As previously indicated **Data Parallel Extensions for Python ** automatically create one canonical queue per device,
144+ As said before, **Data Parallel Extensions for Python ** automatically creates one canonical queue per device,
144145and you normally work with this queue implicitly. However, you can always create as many additional queues per device
145- as needed, and work with them explicitly .
146+ as needed and work explicitly with them, for example, for profiling purposes .
146147
147- A typical situation when you will want to create the queue explicitly is for profiling purposes.
148148Read `Data Parallel Control `_ documentation for more details about queues.
149-
0 commit comments