Skip to content

Commit 3d2dbd6

Browse files
Examples added (#197)
* Added Cython examples 1. Cython/sycl_direct_linkage Example of native extension "cdef import"-ing sycl C++ classes directly from CL/sycl.hpp Queue is created within the function, adding measurable overhead. Extension uses GEMV to compute column-wise total of a C-contiguous matrix, and illustrates linking to oneMKL. 2. Cython/sycl_bufer Example of native extension building on the above, but illustrating getting the queue from dpctl. 3. Cython/usm_memory Example of native extension allocating USM shared memory via dpctl, and using it as a buffer underlying NumPy array. Cython functions dispatches to a SYCL code that works with USM pointer. One function populates USM memory underneath NumPy array with random numbers using ``oneapi::mkl::rng::device`` function used in SYCL kernel, with random number being parameters of European vanilla options. The second function uses SYCL to price these options using Black-Scholes formula. * updated create_sycl_queues.py to run on current dpctl * Extended sycl_buffer example to implement column-wise summation without MKL * few examples illustrating MemoryUSM* objects * have it black's way * Adjusted examples section of global README.md
1 parent 17a1f88 commit 3d2dbd6

24 files changed

+1091
-8
lines changed

README.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -76,11 +76,17 @@ Examples
7676
========
7777
See examples in folder `examples`.
7878

79-
Run examples:
79+
Run python examples:
8080
```bash
81-
python examples/create_sycl_queues.py
81+
for script in `ls examples/python/`; do echo "executing ${script}"; python examples/python/${script}; done
8282
```
8383

84+
Examples of building Cython extensions with DPC++ compiler, that interoperate with dpCtl can be found in
85+
folder `cython`.
86+
87+
Each example in `cython` folder can be built using `CC=clang CXX=dpcpp python setup.py build_ext --inplace`.
88+
Please refer to `run.py` script in respective folders to execute extensions.
89+
8490
Tests
8591
=====
8692
See tests in folder `dpctl/tests`.

examples/cython/sycl_buffer/README.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#1 Example of SYCL extension working NumPy array input via SYCL buffers
2+
3+
4+
#2 Decription
5+
6+
Cython function expecting a 2D array in C-contiguous layout that
7+
computes column-wise total by using SYCL oneMKL (as GEMV call with
8+
an all units vector).
9+
10+
Example illustrates compiling SYCL extension, linking to oneMKL.
11+
12+
13+
#2 Compiling
14+
15+
```
16+
# make sure oneAPI is activated, $ONEAPI_ROOT must be set
17+
CC=clang CXX=dpcpp python setup.py build_ext --inplace
18+
```
19+
20+
21+
#2 Running
22+
23+
```
24+
# SYCL_BE=PI_OPENCL sets SYCL backend to OpenCL to avoid a
25+
# transient issue with MKL's using the default Level-0 backend
26+
(idp) [08:16:12 ansatnuc04 simple]$ SYCL_BE=PI_OPENCL ipython
27+
Python 3.7.7 (default, Jul 14 2020, 22:02:37)
28+
Type 'copyright', 'credits' or 'license' for more information
29+
IPython 7.17.0 -- An enhanced Interactive Python. Type '?' for help.
30+
31+
In [1]: import syclbuffer as sb, numpy as np, dpctl
32+
33+
In [2]: x = np.random.randn(10**4, 2500)
34+
35+
In [3]: %time m1 = np.sum(x, axis=0)
36+
CPU times: user 22.3 ms, sys: 160 µs, total: 22.5 ms
37+
Wall time: 21.2 ms
38+
39+
In [4]: %time m = sb.columnwise_total(x) # first time is slower, due to JIT overhead
40+
CPU times: user 207 ms, sys: 36.1 ms, total: 243 ms
41+
Wall time: 248 ms
42+
43+
In [5]: %time m = sb.columnwise_total(x)
44+
CPU times: user 8.89 ms, sys: 4.12 ms, total: 13 ms
45+
Wall time: 12.4 ms
46+
47+
In [6]: %time m = sb.columnwise_total(x)
48+
CPU times: user 4.82 ms, sys: 8.06 ms, total: 12.9 ms
49+
Wall time: 12.3 ms
50+
```
51+
52+
Running bench.py:
53+
54+
```
55+
========== Executing warm-up ==========
56+
NumPy result: [1. 1. 1. ... 1. 1. 1.]
57+
SYCL(Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz) result: [1. 1. 1. ... 1. 1. 1.]
58+
SYCL(Intel(R) Gen9 HD Graphics NEO) result: [1. 1. 1. ... 1. 1. 1.]
59+
Times for 'opencl:cpu:0'
60+
[2.864787499012891, 2.690436460019555, 2.5902308400254697, 2.5802528870408423, 2.538990616973024]
61+
Times for 'opencl:gpu:0'
62+
[1.9769684099592268, 2.3491444009705447, 2.293720397981815, 2.391633405990433, 1.9465659779962152]
63+
Times for NumPy
64+
[3.4011058019823395, 3.07286038500024, 3.0390414349967614, 3.0305576199898496, 3.002687797998078]
65+
```
66+
67+
Running run.py:
68+
69+
```
70+
(idp) [09:14:53 ansatnuc04 sycl_buffer]$ SYCL_BE=PI_OPENCL python run.py
71+
Result computed by NumPy
72+
[ 0.27170187 -23.36798583 7.31326489 -1.95121928]
73+
Result computed by SYCL extension
74+
[ 0.27170187 -23.36798583 7.31326489 -1.95121928]
75+
76+
Running on: Intel(R) Gen9 HD Graphics NEO
77+
[ 0.27170187 -23.36798583 7.31326489 -1.95121928]
78+
Running on: Intel(R) Core(TM) i7-10710U CPU @ 1.10GHz
79+
[ 0.27170187 -23.36798583 7.31326489 -1.95121928]
80+
```
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
cimport numpy as cnp
2+
import numpy as np
3+
4+
cimport dpctl as c_dpctl
5+
import dpctl
6+
7+
cdef extern from "use_sycl_buffer.h":
8+
int c_columnwise_total(c_dpctl.DPPLSyclQueueRef q, size_t n, size_t m, double *m, double *ct) nogil
9+
int c_columnwise_total_no_mkl(c_dpctl.DPPLSyclQueueRef q, size_t n, size_t m, double *m, double *ct) nogil
10+
11+
def columnwise_total(double[:, ::1] v, method='mkl'):
12+
cdef cnp.ndarray res_array = np.empty((v.shape[1],), dtype='d')
13+
cdef double[::1] res_memslice = res_array
14+
cdef int ret_status
15+
cdef c_dpctl.SyclQueue q
16+
cdef c_dpctl.DPPLSyclQueueRef q_ref
17+
18+
q = c_dpctl.get_current_queue()
19+
q_ref = q.get_queue_ref()
20+
21+
if method == 'mkl':
22+
with nogil:
23+
ret_status = c_columnwise_total(q_ref, v.shape[0], v.shape[1], &v[0,0], &res_memslice[0])
24+
else:
25+
with nogil:
26+
ret_status = c_columnwise_total_no_mkl(q_ref, v.shape[0], v.shape[1], &v[0,0], &res_memslice[0])
27+
28+
return res_array

examples/cython/sycl_buffer/bench.py

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
import dpctl
2+
import syclbuffer as sb
3+
import numpy as np
4+
5+
X = np.full((10 ** 4, 4098), 1e-4, dtype="d")
6+
7+
# warm-up
8+
print("=" * 10 + " Executing warm-up " + "=" * 10)
9+
print("NumPy result: ", X.sum(axis=0))
10+
11+
dpctl.set_default_queue("opencl", "cpu", 0)
12+
print(
13+
"SYCL({}) result: {}".format(
14+
dpctl.get_current_queue().get_sycl_device().get_device_name(),
15+
sb.columnwise_total(X),
16+
)
17+
)
18+
19+
dpctl.set_default_queue("opencl", "gpu", 0)
20+
print(
21+
"SYCL({}) result: {}".format(
22+
dpctl.get_current_queue().get_sycl_device().get_device_name(),
23+
sb.columnwise_total(X),
24+
)
25+
)
26+
27+
import timeit
28+
29+
print("Times for 'opencl:cpu:0'")
30+
print(
31+
timeit.repeat(
32+
stmt="sb.columnwise_total(X)",
33+
setup='dpctl.set_default_queue("opencl", "cpu", 0); '
34+
"sb.columnwise_total(X)", # ensure JIT compilation is not counted
35+
number=100,
36+
globals=globals(),
37+
)
38+
)
39+
40+
print("Times for 'opencl:gpu:0'")
41+
print(
42+
timeit.repeat(
43+
stmt="sb.columnwise_total(X)",
44+
setup='dpctl.set_default_queue("opencl", "gpu", 0); sb.columnwise_total(X)',
45+
number=100,
46+
globals=globals(),
47+
)
48+
)
49+
50+
print("Times for NumPy")
51+
print(timeit.repeat(stmt="X.sum(axis=0)", number=100, globals=globals()))

examples/cython/sycl_buffer/run.py

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
import syclbuffer as sb
2+
import numpy as np
3+
4+
X = np.random.randn(100, 4)
5+
6+
print("Result computed by NumPy")
7+
print(X.sum(axis=0))
8+
print("Result computed by SYCL extension")
9+
print(sb.columnwise_total(X))
10+
11+
12+
print("")
13+
# controlling where to offload
14+
import dpctl
15+
16+
with dpctl.device_context("opencl:gpu"):
17+
print("Running on: ", dpctl.get_current_queue().get_sycl_device().get_device_name())
18+
print(sb.columnwise_total(X))
19+
20+
with dpctl.device_context("opencl:cpu"):
21+
print("Running on: ", dpctl.get_current_queue().get_sycl_device().get_device_name())
22+
print(sb.columnwise_total(X))

examples/cython/sycl_buffer/setup.py

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
import sys
2+
from os.path import join, exists, abspath, dirname
3+
from os import getcwd
4+
from os import environ
5+
from Cython.Build import cythonize
6+
7+
8+
def configuration(parent_package="", top_path=None):
9+
from numpy.distutils.misc_util import Configuration
10+
from numpy.distutils.system_info import get_info
11+
import numpy as np
12+
import dpctl
13+
14+
config = Configuration("", parent_package, top_path)
15+
16+
oneapi_root = environ.get("ONEAPI_ROOT", None)
17+
if not oneapi_root:
18+
raise ValueError("ONEAPI_ROOT must be set, typical value is /opt/intel/oneapi")
19+
20+
mkl_info = {
21+
"include_dirs": [join(oneapi_root, "mkl", "include")],
22+
"library_dirs": [
23+
join(oneapi_root, "mkl", "lib"),
24+
join(oneapi_root, "mkl", "lib", "intel64"),
25+
],
26+
"libraries": [
27+
"mkl_sycl",
28+
"mkl_intel_ilp64",
29+
"mkl_tbb_thread",
30+
"mkl_core",
31+
"tbb",
32+
"iomp5",
33+
],
34+
}
35+
36+
mkl_include_dirs = mkl_info.get("include_dirs")
37+
mkl_library_dirs = mkl_info.get("library_dirs")
38+
mkl_libraries = mkl_info.get("libraries")
39+
40+
pdir = dirname(__file__)
41+
wdir = join(pdir)
42+
43+
eca = ["-Wall", "-Wextra", "-fsycl", "-fsycl-unnamed-lambda"]
44+
45+
config.add_extension(
46+
name="syclbuffer",
47+
sources=[
48+
join(pdir, "_buffer_example.pyx"),
49+
join(wdir, "use_sycl_buffer.cpp"),
50+
join(wdir, "use_sycl_buffer.h"),
51+
],
52+
include_dirs=[wdir, np.get_include(), dpctl.get_include()] + mkl_include_dirs,
53+
libraries=["sycl"] + mkl_libraries,
54+
runtime_library_dirs=mkl_library_dirs,
55+
extra_compile_args=eca, # + ['-O0', '-g', '-ggdb'],
56+
extra_link_args=["-fPIC"],
57+
language="c++",
58+
)
59+
60+
config.ext_modules = cythonize(config.ext_modules, include_path=[pdir, wdir])
61+
return config
62+
63+
64+
if __name__ == "__main__":
65+
from numpy.distutils.core import setup
66+
67+
setup(configuration=configuration)
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
#include <CL/sycl.hpp>
2+
#include "use_sycl_buffer.h"
3+
#include <oneapi/mkl.hpp>
4+
#include "dppl_sycl_types.h"
5+
6+
int
7+
c_columnwise_total(DPPLSyclQueueRef q_ref, size_t n, size_t m, double *mat, double *ct) {
8+
9+
sycl::queue q = *(reinterpret_cast<sycl::queue *>(q_ref));
10+
11+
sycl::buffer<double, 1> mat_buffer = sycl::buffer(mat, sycl::range<1>(n * m));
12+
sycl::buffer<double, 1> ct_buffer = sycl::buffer(ct, sycl::range<1>(m));
13+
14+
double *ones = reinterpret_cast<double *>(malloc(n * sizeof(double)));
15+
{
16+
sycl::buffer<double, 1> ones_buffer = sycl::buffer(ones, sycl::range<1>(n));
17+
18+
try {
19+
auto ev = q.submit([&](sycl::handler &cgh) {
20+
auto ones_acc = ones_buffer.get_access<sycl::access::mode::read_write>(cgh);
21+
cgh.fill(ones_acc, double(1.0));
22+
});
23+
24+
ev.wait_and_throw();
25+
}
26+
catch (sycl::exception const& e) {
27+
std::cout << "\t\tCaught synchronous SYCL exception during fill:\n"
28+
<< e.what() << std::endl << "OpenCL status: " << e.get_cl_code() << std::endl;
29+
goto cleanup;
30+
}
31+
32+
try {
33+
oneapi::mkl::blas::row_major::gemv(
34+
q,
35+
oneapi::mkl::transpose::trans,
36+
n, m, double(1.0), mat_buffer, m,
37+
ones_buffer, 1,
38+
double(0.0), ct_buffer, 1);
39+
q.wait();
40+
}
41+
catch (sycl::exception const &e) {
42+
std::cout << "\t\tCaught synchronous SYCL exception during GEMV:\n"
43+
<< e.what() << std::endl << "OpenCL status: " << e.get_cl_code() << std::endl;
44+
goto cleanup;
45+
}
46+
}
47+
48+
free(ones);
49+
return 0;
50+
51+
cleanup:
52+
free(ones);
53+
return -1;
54+
}
55+
56+
inline size_t upper_multiple(size_t n, size_t wg) { return wg * ((n + wg - 1)/wg); }
57+
58+
int
59+
c_columnwise_total_no_mkl(DPPLSyclQueueRef q_ref, size_t n, size_t m, double *mat, double *ct) {
60+
61+
sycl::queue q = *(reinterpret_cast<sycl::queue *>(q_ref));
62+
63+
sycl::buffer<double, 2> mat_buffer = sycl::buffer(mat, sycl::range<2>(n, m));
64+
sycl::buffer<double, 1> ct_buffer = sycl::buffer(ct, sycl::range<1>(m));
65+
66+
auto e = q.submit(
67+
[&](sycl::handler &h) {
68+
sycl::accessor ct_acc {ct_buffer, h, sycl::write_only};
69+
h.parallel_for(
70+
sycl::range<1>(m),
71+
[=](sycl::id<1> i){
72+
ct_acc[i] = 0.0;
73+
});
74+
});
75+
76+
constexpr size_t wg = 256;
77+
auto e2 = q.submit(
78+
[&](sycl::handler &h) {
79+
80+
sycl::accessor mat_acc {mat_buffer, h, sycl::read_only};
81+
sycl::accessor ct_acc {ct_buffer, h};
82+
h.depends_on(e);
83+
84+
sycl::range<2> global {upper_multiple(n, wg), m};
85+
sycl::range<2> local {wg, 1};
86+
87+
h.parallel_for(
88+
sycl::nd_range<2>(global, local),
89+
[=](sycl::nd_item<2> it) {
90+
size_t i = it.get_global_id(0);
91+
size_t j = it.get_global_id(1);
92+
double group_sum = sycl::ONEAPI::reduce(
93+
it.get_group(),
94+
(i < n) ? mat_acc[it.get_global_id()] : 0.0,
95+
std::plus<double>()
96+
);
97+
if (it.get_local_id(0) == 0) {
98+
sycl::ONEAPI::atomic_ref<
99+
double,
100+
sycl::ONEAPI::memory_order::relaxed,
101+
sycl::ONEAPI::memory_scope::system,
102+
sycl::access::address_space::global_space>(ct_acc[j]) += group_sum;
103+
}
104+
});
105+
});
106+
107+
e2.wait_and_throw();
108+
return 0;
109+
}
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#include <CL/sycl.hpp>
2+
#include "dppl_sycl_types.h"
3+
4+
extern int c_columnwise_total(
5+
DPPLSyclQueueRef q, size_t n, size_t m, double *mat, double *ct);
6+
extern int c_columnwise_total_no_mkl(
7+
DPPLSyclQueueRef q, size_t n, size_t m, double *mat, double *ct);

0 commit comments

Comments
 (0)