Skip to content

Conversation

@RossBrunton
Copy link
Contributor

This updates the spec to provide a way for async errors to be signaled
from, for example, kernels. The error is stored on the queue and can be
queried with olGetQueueError. In addition, if any other queues are
waiting on the error'd queue they will also enter the error state.

With this design, both olSyncEvent and olSyncQueue will now exit
early on error. More specifically, unless a kernel gets trapped in an
infinite loop, both sync functions will always return in a finite amount
of time.

This updates the spec to provide a way for async errors to be signaled
from, for example, kernels. The error is stored on the queue and can be
queried with `olGetQueueError`. In addition, if any other queues are
waiting on the error'd queue they will also enter the error state.

With this design, both `olSyncEvent` and `olSyncQueue` will now exit
early on error. More specifically, unless a kernel gets trapped in an
infinite loop, both sync functions will always return in a finite amount
of time.
Copy link
Contributor

@jhuber6 jhuber6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have an example of what this would look like in practice?

@RossBrunton
Copy link
Contributor Author

@jhuber6 From the end user perspective, I'm imagining something like this:

ol_queue_handle_t Queue;
olCreateQueue(Device, &Queue);
// Enqueue a lot of work to the queue
auto Err = olSyncQueue(Queue);
if (Err && Err->Code == OL_ERRC_QUEUE_ERROR) {
  olGetQueueError(Queue, &Err, nullptr);
  std::cerr << "Error: " << Err.Desc << "\n";
}

I'm still not 100% sure how to implement it on AMD/Nvidia. I was going to look into it more if the design looked good.

@RossBrunton
Copy link
Contributor Author

Going to close this, as I don't see myself working on this any time soon. I was originally going to work on implementing an example in the AMDGPU rtl, but never got around to starting it. I'll detail what I was planning on doing, although I can't say for sure this can actually be implemented.

Basically, every queue has an "error" signal. When a task is complete, it either decrements the "success" signal (which is also the input trigger for the next task) or the "error" signal (for which only one exists per queue. olSyncQueue and friends wait on both the "error" signal and the "success" signal from the final task, and can use which of the two signals it got to determine whether it encountered an error or not.

This means that a task failing effectively "skips the queue" and causes the entire pipeline to stop, but also allows olSyncQueue to actually terminate rather than hang.

The weird dependency system between queues means that this failure signal is still sent to the "wait"-ing queue when if the failing queue happens to encounter an error. I'm not 100% sure on how to implement that part.

@RossBrunton RossBrunton closed this Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants