Skip to content

Conversation

EuphoricThinking
Copy link
Contributor

@EuphoricThinking EuphoricThinking commented Aug 12, 2025

Adding a new feature: batched queue submissions.

Batched queues enable submission of operations to the driver in batches, therefore reducing the overhead of submitting every single operation individually. Similarly to command buffers in L0v2, they use regular command lists (later referenced as 'batches'). Operations enqueued on regular command lists are not executed immediately, but only after enqueueing the regular command list on an immediate command list. However, in contrast to command buffers, batched queues also handle submission of batches (regular command lists) instead of only collecting enqueued operations, by using an internal immediate command list.

Batched queues introduce:

  • batch_manager stores the current batch, the command list manager with an immediate command list for batch submissions, the vector of submitted batches, the generation number of the current batch.
  • The current batch is a command list manager with a regular command list; operations requested by users are enqueued on the current batch. The current batch may be submitted for execution on the immediate command list, replaced by a new regular command list and stored for execution completion in the vector of submitted batches.
  • The number of regular command lists stored for execution is limited.
  • The generation number of the current batch is assigned to events associated with operations enqueued on the given batch. It is incremented during every replacement of the current batch. When an event created by a batched queue appears in an eventWaitList, the batch assigned to the given event might not have been executed yet and the event might never be signalled. Comparing generation numbers enables determining whether the current batch should be submitted for execution. If the generation number of the current batch is higher than the number assigned to the given event, the batch associated with the event has already been submitted for execution and additional submission of the current batch is not needed.
  • Regular command lists use the regular pool cache type, whereas immediate command lists use the immediate pool cache type. Since user-requested operations are enqueued on regular command lists and immediate command lists are only used internally by the batched queue implementation, events are not created for immediate command lists.
  • wait_list_view is modified. Previously, it only stored the waitlist (as a ze_event_handle buffer created from events) and the corresponding event count in a single container, which could be passed as an argument to the driver API. Currently, the constructor also ensures that all associated operations will eventually be executed. Since regular command lists are not executed immediately, but only after enqueueing on immediate lists, it is necessary to enqueue the regular command list associated with the given event. Otherwise, the event would never be signalled.

Additionally, support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo has been added for native CPU, which is required by the enqueueTimestampRecording tests. Currently, enqueueTimestampRecording is not supported by batched queues.

Batched queues can be enabled by setting UR_QUEUE_FLAG_SUBMISSION_BATCHED in ur_queue_flags_t or globally, through the environment variable UR_L0_FORCE_BATCHED=1.

Benchmark results for default in-order queues (sycl branch, commit hash: b76f12e) and batched queues:
api_overhead_benchmark_ur SubmitKernel in order: 20.839 μs
api_overhead_benchmark_ur SubmitKernel batched: 12.183 μs

For tests in CI, batched queues are enabled by default, which may cause failures in tests dedicated for other types of queues, i.e. in-order. This would be reset after tests in CI.

Copy link
Contributor

@pbalcer pbalcer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you been able to run the SubmitKernel benchmarks? If so, can you please share results?

@EuphoricThinking EuphoricThinking marked this pull request as ready for review October 16, 2025 11:31
@EuphoricThinking EuphoricThinking requested review from a team as code owners October 16, 2025 11:31

namespace v2 {

struct batch_manager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You use three different styles of multi-line comments in this struct. I think the most commonly used style in the adapter codebase is:

//
//
//

But if you want to use block-style comments, do:

/*
 * ...
 * ...
 */

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have unified the format in this file for multi-line comments, although it still differs slightly from your recommendation (this is the version from the formatter)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the same for all the comments in the patch?

this is the version from the formatter
AFAIK clang-format's LLVM multi-line comment style is this:

/*
 * first line
 * second line
 */

I'd be odd if it changed it to something else.

Batched queues enable submission of operations to the driver in
batches, therefore reducing the overhead of submitting every single
operation individually. Similarly to command buffers in L0v2, they use
regular command lists (later referenced as 'batches'). Operations
enqueued on regular command lists are not executed immediately, but
only after enqueueing the regular command list on an immediate command
list. However, in contrast to command buffers, batched queues also
handle submission of batches (regular command lists) instead of only
collecting enqueued operations, by using an internal immediate command
list.

Batched queues introduce:
- batch_manager stores the current batch, the command list manager
with an immediate command list for batch submissions, the vector of
submitted batches, the generation number of the current batch.
- The current batch is a command list manager with a regular command
list; operations requested by users are enqueued on the current batch.
The current batch may be submitted for execution on the immediate
command list, replaced by a new regular command list and stored for
execution completion in the vector of submitted batches.
- The number of regular command lists stored for execution is limited.
- The generation number of the current batch is assigned to events
associated with operations enqueued on the given batch. It is
incremented during every replacement of the current batch. When an
event created by a batched queue appears in an eventWaitList, the
batch assigned to the given event might not have been executed yet and
the event might never be signalled. Comparing generation numbers
enables determining whether the current batch should be submitted for
execution. If the generation number of the current batch is higher
than the number assigned to the given event, the batch associated with
the event has already been submitted for execution and additional
submission of the current batch is not needed.
- Regular command lists use the regular pool cache type, whereas
immediate command lists use the immediate pool cache type. Since
user-requested operations are enqueued on regular command lists and
immediate command lists are only used internally by the batched queue
implementation, events are not created for immediate command lists.
- wait_list_view is modified. Previously, it only stored the waitlist
(as a ze_event_handle buffer created from events) and the
corresponding event count in a single container, which could be passed
as an argument to the driver API. Currently, the constructor also
ensures that all associated operations will eventually be executed.
Since regular command lists are not executed immediately, but only
after enqueueing on immediate lists, it is necessary to enqueue the
regular command list associated with the given event. Otherwise, the
event would never be signalled.

Additionally, support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo has
been added for native CPU, which is required by the
enqueueTimestampRecording tests. Currently, enqueueTimestampRecording
is not supported by batched queues.

Batched queues can be enabled by setting
UR_QUEUE_FLAG_SUBMISSION_BATCHED in ur_queue_flags_t or globally,
through the environment variable UR_L0_FORCE_BATCHED=1.

Benchmark results for default in-order queues (sycl branch, commit
hash: b76f12e) and batched queues:
api_overhead_benchmark_ur SubmitKernel in order: 20.839 μs
api_overhead_benchmark_ur SubmitKernel batched:  12.183 μs
Comment on lines +36 to +40
/*
Support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo is required by the
enqueueTimestampRecording tests after introducing batched queues, since
batched queues do not support enqueueTimestampRecording.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*
Support for UR_QUEUE_INFO_FLAGS in urQueueGetInfo is required by the
enqueueTimestampRecording tests after introducing batched queues, since
batched queues do not support enqueueTimestampRecording.
*/

This sounds more like a commit message (context why a change is made), rather than a comment (why a piece of code does something).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it better to move this change to another commit, remove this comment or do you mean something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd just remove the comment. Ideally this would be a separate commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants