Skip to content
Open
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 11 additions & 11 deletions src/blas/backends/cublas/cublas_batch.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@
* limitations under the License.
*
**************************************************************************/
#include <CL/sycl/detail/pi.hpp>
#include "cublas_helper.hpp"
#include "cublas_scope_handle.hpp"
#include "cublas_task.hpp"

#include "oneapi/mkl/exceptions.hpp"
#include "oneapi/mkl/blas/detail/cublas/onemkl_blas_cublas.hpp"

Expand All @@ -42,12 +42,12 @@ inline void gemm_batch(Func func, cl::sycl::queue &queue, transpose transa, tran
auto a_acc = a.template get_access<cl::sycl::access::mode::read>(cgh);
auto b_acc = b.template get_access<cl::sycl::access::mode::read>(cgh);
auto c_acc = c.template get_access<cl::sycl::access::mode::read_write>(cgh);
cgh.interop_task([=](cl::sycl::interop_handler ih) {
auto sc = CublasScopedContextHandler(queue);
onemkl_cublas_host_task(cgh, queue,[=](CublasScopedContextHandler sc) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does sc have to be passed in by copy or would non-const reference make more sense? In general I imagine that this object would contain state, so passing in by reference might be more convenient and/or more performant.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, indeed sc can be passed as a reference.

auto handle = sc.get_handle(queue);
auto a_ = sc.get_mem<cuDataType *>(ih, a_acc);
auto b_ = sc.get_mem<cuDataType *>(ih, b_acc);
auto c_ = sc.get_mem<cuDataType *>(ih, c_acc);

auto a_ = sc.get_mem<cuDataType *>(a_acc);
auto b_ = sc.get_mem<cuDataType *>(b_acc);
auto c_ = sc.get_mem<cuDataType *>(c_acc);
cublasStatus_t err;
CUBLAS_ERROR_FUNC(func, err, handle, get_cublas_operation(transa),
get_cublas_operation(transb), m, n, k, (cuDataType *)&alpha, a_, lda,
Expand Down Expand Up @@ -122,9 +122,9 @@ inline cl::sycl::event gemm_batch(Func func, cl::sycl::queue &queue, transpose t
for (int64_t i = 0; i < num_events; i++) {
cgh.depends_on(dependencies[i]);
}
cgh.interop_task([=](cl::sycl::interop_handler ih) {
auto sc = CublasScopedContextHandler(queue);
onemkl_cublas_host_task(cgh, queue,[=](CublasScopedContextHandler sc) {
auto handle = sc.get_handle(queue);

auto a_ = reinterpret_cast<const cuDataType *>(a);
auto b_ = reinterpret_cast<const cuDataType *>(b);
auto c_ = reinterpret_cast<cuDataType *>(c);
Expand Down Expand Up @@ -170,9 +170,9 @@ inline cl::sycl::event gemm_batch(Func func, cl::sycl::queue &queue, transpose *
for (int64_t i = 0; i < num_events; i++) {
cgh.depends_on(dependencies[i]);
}
cgh.interop_task([=](cl::sycl::interop_handler ih) {
auto sc = CublasScopedContextHandler(queue);
onemkl_cublas_host_task(cgh, queue,[=](CublasScopedContextHandler sc) {
auto handle = sc.get_handle(queue);

int64_t offset = 0;
cublasStatus_t err;
for (int64_t i = 0; i < group_count; i++) {
Expand Down
4 changes: 2 additions & 2 deletions src/blas/backends/cublas/cublas_extensions.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@
* limitations under the License.
*
**************************************************************************/
#include <CL/sycl/detail/pi.hpp>
#include "cublas_helper.hpp"
#include "cublas_scope_handle.hpp"
#include "cublas_task.hpp"

#include "oneapi/mkl/exceptions.hpp"
#include "oneapi/mkl/blas/detail/cublas/onemkl_blas_cublas.hpp"

Expand Down
190 changes: 97 additions & 93 deletions src/blas/backends/cublas/cublas_level1.cpp

Large diffs are not rendered by default.

306 changes: 153 additions & 153 deletions src/blas/backends/cublas/cublas_level2.cpp

Large diffs are not rendered by default.

131 changes: 65 additions & 66 deletions src/blas/backends/cublas/cublas_level3.cpp

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion src/blas/backends/cublas/cublas_scope_handle.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ cublas_handle::~cublas_handle() noexcept(false) {
*/
thread_local cublas_handle CublasScopedContextHandler::handle_helper = cublas_handle{};

CublasScopedContextHandler::CublasScopedContextHandler(cl::sycl::queue queue) {
CublasScopedContextHandler::CublasScopedContextHandler(cl::sycl::queue queue, cl::sycl::interop_handler& ih): ih(ih){
placedContext_ = queue.get_context();
auto device = queue.get_device();
auto desired = cl::sycl::get_native<cl::sycl::backend::cuda>(placedContext_);
Expand Down
5 changes: 3 additions & 2 deletions src/blas/backends/cublas/cublas_scope_handle.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -68,12 +68,13 @@ class CublasScopedContextHandler {
CUcontext original_;
cl::sycl::context placedContext_;
bool needToRecover_;
cl::sycl::interop_handler& ih;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how will this work, if there's no interop_handler in hipsycl?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to ifdef the headers in cublas_task.hpp, such that this header is not even included in case we are compiling with hipSYCL.

see: sbalint98#2

static thread_local cublas_handle handle_helper;
CUstream get_stream(const cl::sycl::queue &queue);
cl::sycl::context get_context(const cl::sycl::queue &queue);

public:
CublasScopedContextHandler(cl::sycl::queue queue);
CublasScopedContextHandler(cl::sycl::queue queue, cl::sycl::interop_handler& ih);

~CublasScopedContextHandler() noexcept(false);
/**
Expand All @@ -87,7 +88,7 @@ class CublasScopedContextHandler {
// This is a work-around function for reinterpret_casting the memory. This
// will be fixed when SYCL-2020 has been implemented for Pi backend.
template <typename T, typename U>
inline T get_mem(cl::sycl::interop_handler ih, U acc) {
inline T get_mem(U acc) {
CUdeviceptr cudaPtr = ih.get_mem<cl::sycl::backend::cuda>(acc);
return reinterpret_cast<T>(cudaPtr);
}
Expand Down
33 changes: 33 additions & 0 deletions src/blas/backends/cublas/cublas_task.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#ifndef _MKL_BLAS_CUBLAS_TASK_HPP_
#define _MKL_BLAS_CUBLAS_TASK_HPP_
#include <cublas_v2.h>
#include <cuda.h>
#include <complex>
#include <CL/sycl.hpp>
#include "oneapi/mkl/types.hpp"
#include "cublas_scope_handle.hpp"
#include <CL/sycl/detail/pi.hpp>

namespace oneapi {
namespace mkl {
namespace blas {
namespace cublas {

template <typename H, typename F>
static inline auto host_task_internal(H &cgh, cl::sycl::queue queue, F f) -> decltype(cgh.interop_task(f)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the decltype here looks fishy.. first of all, you don't return anything, did you mean to write return cgh.interop_task([f,queue]..)?
secondly, if making decltype(cgh.interop_task(f)) shouldn't work at all, as interop_task would not be able to instantiate f, if it doesn't take an interop_handler as parameter.. right?

so either replace the auto and decltype.. with void (as you are discarding the result in onemkl_cublas_host_task anyways or you have to use something like decltype(cgh.interop_task([](cl::sycl::interop_handler ih)) (you could also change the parameter type to auto, I think..)

Copy link
Author

@sbalint98 sbalint98 May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your totally right, this part of the code is quite confusing. I followed the pattern that has been used for the CPU functions see: https://github.com/oneapi-src/oneMKL/blob/e8e3dabf9fbda0556b8075c76b657336f88440f0/src/blas/backends/mklcpu/mklcpu_common.hpp#L42-L56

Probably would be nicer to do something like:

template <typname H, typename F, deceltype(cgh.interop_task(f))
.... void host_task_internal...

I intend to add the functionality for hipSYCL like this:
sbalint98#2

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it even return? I think none of the handler member functions that are there to be called inside queue::submit have a non-void return type in SYCL. I imagine this is also the case for interop_task - so why can the return type not simply be void?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's basically sfinae on whether the function exists or not...
For CPU they already use something similar in oneMKL, but there the interface of f fits the requirements of the task function, which is not the case here(taking the scoped thing instead of a handle(r), but you can't wrap it in a lambda either, as it's in an unevaluated context. )
It probably works as log the checked functions don't use sfinae to check the signature of the parameter..

cgh.interop_task([f, queue](cl::sycl::interop_handler ih){
auto sc = CublasScopedContextHandler(queue, ih);
f(sc);
});
}

template <typename H, typename F>
static inline void onemkl_cublas_host_task(H &cgh, cl::sycl::queue queue, F f) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so.. you will be ifdefing the host_task_internal depending on the implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This what I had in mind, yes.

(void)host_task_internal(cgh, queue, f);
}

} // namespace cublas
} // namespace blas
} // namespace mkl
} // namespace oneapi
#endif // _MKL_BLAS_CUBLAS_TASK_HPP_