RFC: [Offload] Design for async error handling #155596

RossBrunton · 2025-08-27T11:22:41Z

This updates the spec to provide a way for async errors to be signaled
from, for example, kernels. The error is stored on the queue and can be
queried with olGetQueueError. In addition, if any other queues are
waiting on the error'd queue they will also enter the error state.

With this design, both olSyncEvent and olSyncQueue will now exit
early on error. More specifically, unless a kernel gets trapped in an
infinite loop, both sync functions will always return in a finite amount
of time.

This updates the spec to provide a way for async errors to be signaled from, for example, kernels. The error is stored on the queue and can be queried with `olGetQueueError`. In addition, if any other queues are waiting on the error'd queue they will also enter the error state. With this design, both `olSyncEvent` and `olSyncQueue` will now exit early on error. More specifically, unless a kernel gets trapped in an infinite loop, both sync functions will always return in a finite amount of time.

jhuber6

Do you have an example of what this would look like in practice?

RossBrunton · 2025-08-27T14:52:07Z

@jhuber6 From the end user perspective, I'm imagining something like this:

ol_queue_handle_t Queue;
olCreateQueue(Device, &Queue);
// Enqueue a lot of work to the queue
auto Err = olSyncQueue(Queue);
if (Err && Err->Code == OL_ERRC_QUEUE_ERROR) {
  olGetQueueError(Queue, &Err, nullptr);
  std::cerr << "Error: " << Err.Desc << "\n";
}

I'm still not 100% sure how to implement it on AMD/Nvidia. I was going to look into it more if the design looked good.

RossBrunton · 2025-10-01T13:58:15Z

Going to close this, as I don't see myself working on this any time soon. I was originally going to work on implementing an example in the AMDGPU rtl, but never got around to starting it. I'll detail what I was planning on doing, although I can't say for sure this can actually be implemented.

Basically, every queue has an "error" signal. When a task is complete, it either decrements the "success" signal (which is also the input trigger for the next task) or the "error" signal (for which only one exists per queue. olSyncQueue and friends wait on both the "error" signal and the "success" signal from the final task, and can use which of the two signals it got to determine whether it encountered an error or not.

This means that a task failing effectively "skips the queue" and causes the entire pipeline to stop, but also allows olSyncQueue to actually terminate rather than hang.

The weird dependency system between queues means that this failure signal is still sent to the "wait"-ing queue when if the failing queue happens to encounter an error. I'm not 100% sure on how to implement that part.

RossBrunton requested review from callumfare and jhuber6 August 27, 2025 11:23

jhuber6 reviewed Aug 27, 2025

View reviewed changes

RossBrunton closed this Oct 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: [Offload] Design for async error handling #155596

RFC: [Offload] Design for async error handling #155596

Uh oh!

RossBrunton commented Aug 27, 2025

Uh oh!

jhuber6 left a comment

Uh oh!

RossBrunton commented Aug 27, 2025

Uh oh!

RossBrunton commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

RFC: [Offload] Design for async error handling #155596

RFC: [Offload] Design for async error handling #155596

Uh oh!

Conversation

RossBrunton commented Aug 27, 2025

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

RossBrunton commented Aug 27, 2025

Uh oh!

RossBrunton commented Oct 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants