Skip to content

Conversation

@steffenlarsen
Copy link
Contributor

@steffenlarsen steffenlarsen commented Dec 2, 2024

This commit fixes an issue where memory operations enqueued through the enqueue free functions would not correctly mark the resulting events as discarded, breaking in-order barrier assumptions.

Fixes #15606.

Co-authored-by: Andrey Alekseenko [email protected]

This commit fixes an issue where memory operations enqueued through
the enqueue free functions would not correctly mark the resulting events
as discarded, breaking in-order barrier assumptions.

Signed-off-by: Larsen, Steffen <[email protected]>
@al42and
Copy link
Contributor

al42and commented Dec 2, 2024

Can confirm, this fixes #15606 as reported.

But on the application level, I now get "wait method cannot be used for a discarded event" exception from queue::wait_and_throw() in tests with multiple threads that I don't get without this PR. I can look deeper into it, but perhaps it's immediately obvious for you what goes wrong here?

Example trace: 16223.log (the "MPI rank" in the error message refers to a thread, we're using our thread-based MPI imitation).

@steffenlarsen steffenlarsen marked this pull request as draft December 3, 2024 06:41
@steffenlarsen
Copy link
Contributor Author

steffenlarsen commented Dec 3, 2024

Can confirm, this fixes #15606 as reported.

But on the application level, I now get "wait method cannot be used for a discarded event" exception from queue::wait_and_throw() in tests with multiple threads that I don't get without this PR. I can look deeper into it, but perhaps it's immediately obvious for you what goes wrong here?

Example trace: 16223.log (the "MPI rank" in the error message refers to a thread, we're using our thread-based MPI imitation).

Thank you, @al42and! I will convert this to draft while I investigate. It looks like the problem comes after a mem-fill operation, so there must be some dependencies that don't know how to handle the case where they are discarded.

Update: I have not yet been able to reproduce the failure. @al42and - Would you be able to provide a stack trace from the throw-site? Hope is it can tell me what kind of dependencies are causing this.

@al42and
Copy link
Contributor

al42and commented Dec 3, 2024

Would you be able to provide a stack trace from the throw-site? Hope is it can tell me what kind of dependencies are causing this.

(gdb) bt
#0  0x00007fffe6aae4a1 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007fffe3881b9b in sycl::_V1::detail::event_impl::wait(std::shared_ptr<sycl::_V1::detail::event_impl>, bool*) [clone .cold] () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#2  0x00007fffe3a8cd29 in sycl::_V1::detail::queue_impl::wait(sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#3  0x00007fffe3b2ec29 in sycl::_V1::queue::wait_and_throw_proxy(sycl::_V1::detail::code_location const&) () from /home/aland/intel-sycl/llvm/build/install//lib/libsycl.so.8
#4  0x00007fffeb3bb4b5 in sycl::_V1::queue::wait_and_throw (this=0x7fffffff5c98, CodeLoc=...) at /home/aland/intel-sycl/llvm/build/install/include/sycl/queue.hpp:431
#5  0x00007fffeb3ba0f0 in DeviceStream::synchronize (this=0x374ed40) at /home/aland/gromacs/src/gromacs/gpu_utils/device_stream_sycl.cpp:162

Since I'm lazy to rebuild LLVM in debug mode, printf-based tracing suggest that:

Running with SYCL_UR_TRACE indeed suggests that there's a urEnqueueUSMFill call nearby, but it's hard to disentangle the output from two threads, and the problem does not reproduce with real MPI with separate processes or when running threads sequentially in the debugger. I'm looking at making a smaller reproducer, but no luck so far.

@al42and
Copy link
Contributor

al42and commented Dec 3, 2024

#include <iostream>
#include <sycl/sycl.hpp>
#include <thread>
#include <unistd.h>

static constexpr int nthreads = 2;
static constexpr int niter = 20;

void threadFunction(int tid) {
  sycl::device dev(sycl::gpu_selector_v);
  std::cout << dev.get_info<sycl::info::device::name>() << std::endl;

  sycl::queue q{dev, {sycl::property::queue::in_order()}};

  constexpr int size = 128 * 128 * 128;
  int *d_buf = sycl::malloc_device<int>(size, q);
  int *h_buf = sycl::malloc_host<int>(size, q);

  const sycl::nd_range<1> range1D{{size}, {128}};

  std::vector<sycl::event> evs;
  for (int i = 0; i < niter; i++) {
    evs.push_back(q.memcpy(h_buf, d_buf, size * sizeof(int)));
    sycl::ext::oneapi::experimental::submit(
        q, [&](sycl::handler &cgh) { cgh.fill<int>(d_buf, 1, size); });
  }

  q.wait_and_throw();
  std::cout << "After waiting for the queue" << std::endl;
  std::cout << h_buf[0] << std::endl;

  sycl::free(d_buf, q);
  sycl::free(h_buf, q);
}

int main() {
  std::array<std::thread, nthreads> threads;

  for (int i = 0; i < nthreads; i++) {
    threads[i] = std::thread{threadFunction, i};
  }

  for (int i = 0; i < nthreads; i++) {
    threads[i].join();
  }
  std::cout << "All threads have finished." << std::endl;

  return 0;
}
$ clang++ -fsycl 16223.cpp && ONEAPI_DEVICE_SELECTOR=opencl:gpu ./a.out 
Intel(R) Arc(TM) A770 Graphics
Intel(R) Arc(TM) A770 Graphics
terminate called after throwing an instance of 'sycl::_V1::exception'
  what():  wait method cannot be used for a discarded event.
Aborted (core dumped)

Looks like it's not a specific operation, but some smart pointer lifetime issue? Also, this feels like a nasty unrelated issue of pointer re-use that only now is surfacing, but I could easily be wrong here.

@steffenlarsen
Copy link
Contributor Author

Thanks a ton, @al42and ! I believe the problematic cases have now been addressed and I have adapted your code into a smaller regression test. I could not find your signature on previous commit, so please let me know if you would like to be added as co-author.

@steffenlarsen steffenlarsen marked this pull request as ready for review December 4, 2024 07:57
@al42and
Copy link
Contributor

al42and commented Dec 4, 2024

I believe the problematic cases have now been addressed and I have adapted your code into a smaller regression test.

Thank you! Can confirm that it all works now.

I could not find your signature on previous commit, so please let me know if you would like to be added as co-author.

Andrey Alekseenko <[email protected]>

Copy link
Contributor

@cperkinsintel cperkinsintel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.


if (Event)
MEvent->setHandle(*Event);
SetEventHandleOrDiscard();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3 x N

@steffenlarsen steffenlarsen merged commit f26c0b8 into intel:sycl Dec 9, 2024
14 checks passed
@al42and
Copy link
Contributor

al42and commented Dec 11, 2024

@steffenlarsen: Thanks again for the fix! Do you know if this is going into the next patch release of oneAPI 2025.0 (productized version, not open-source one), or will it wait for oneAPI 2025.1? Need to know which versions to warn about :)

@steffenlarsen
Copy link
Contributor Author

@steffenlarsen: Thanks again for the fix! Do you know if this is going into the next patch release of oneAPI 2025.0 (productized version, not open-source one), or will it wait for oneAPI 2025.1? Need to know which versions to warn about :)

Sadly it won't make 2025.0, but I will do what I can to make sure it gets into the following minor release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SYCL_EXT_ONEAPI_ENQUEUE_BARRIER not working with SYCL_EXT_ONEAPI_ENQUEUE_FUNCTIONS

3 participants