Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
192 changes: 192 additions & 0 deletions sycl/doc/extensions/proposed/sycl_ext_oneapi_clock.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
= sycl_ext_oneapi_clock

:source-highlighter: coderay
:coderay-linenums-mode: table

// This section needs to be after the document title.
:doctype: book
:toc2:
:toc: left
:encoding: utf-8
:lang: en
:dpcpp: pass:[DPC++]
:endnote: —{nbsp}end{nbsp}note

// Set the default source code type in this document to C++,
// for syntax highlighting purposes. This is needed because
// docbook uses c++ and html5 uses cpp.
:language: {basebackend@docbook:c++:cpp}


== Notice

[%hardbreaks]
Copyright (C) 2025 Intel Corporation. All rights reserved.

Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks
of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by
permission by Khronos.


== Contact

To report problems with this extension, please open a new issue at:

https://github.com/intel/llvm/issues


== Dependencies

This extension is written against the SYCL 2020 revision 10 specification. All
references below to the "core SYCL specification" or to section numbers in the
SYCL specification refer to that revision.

== Status

This is a proposed extension specification, intended to gather community
feedback. Interfaces defined in this specification may not be implemented yet
or may be in a preliminary state. The specification itself may also change in
incompatible ways before it is finalized. *Shipping software products should
not rely on APIs defined in this specification.*

== Backend support status

The APIs in this extension may be used only on a device that has
`aspect::ext_oneapi_clock_sub_group`, `aspect::ext_oneapi_clock_work_group` or
`aspect::ext_oneapi_clock_device` accordingly. The application must check that
the device has these aspects before submitting a kernel using a corresponding
API in this extension. If the application fails to do this, the implementation
throws a synchronous exception with the `errc::kernel_not_supported` error code
when the kernel is submitted to the queue.

== Overview

This extension introduces a new free function `clock<clock_scope>()`. This
function allows the user to sample the value from one of three clocks provided
by the compute units, depending on the value of the scope argument. The clocks
in this extension do not necessarily count units of time. For example, they may
count cycles instead. In addition, the cycle frequency may change as the kernel
executes. As a result, there is no portable way to convert the values returned
by these clocks into time durations.

`scope` is an enumeration constant of the new `clock_scope` enum. It should be
passed to the function to define the clock source; e.g.,
`clock<clock_scope::sub_group>()` samples the value from a clock shared by all
work-items executing in the same sub-group.

This extension also adds new aspects: `ext_oneapi_clock_sub_group`,
`ext_oneapi_clock_work_group` and `ext_oneapi_clock_device` indicating whether
the device supports the corresponding clock scopes.

== Specification

=== Feature test macro

This extension provides a feature-test macro as described in the core SYCL
specification. An implementation supporting this extension must predefine the
macro `SYCL_EXT_ONEAPI_CLOCK` to one of the values defined in the table
below. Applications can test for the existence of this macro to determine if
the implementation supports this feature, or applications can test the macro's
value to determine which of the extension's features the implementation
supports.

[%header,cols="1,5"]
|===
|Value
|Description

|1
|The APIs of this experimental extension are not versioned, so the feature-test
macro always has this value.
|===

=== New device aspects

This extension adds new device aspects:

```c++
namespace sycl {

enum class aspect : /*unspecified*/ {
ext_oneapi_clock_sub_group,
ext_oneapi_clock_work_group,
ext_oneapi_clock_device
};

} // namespace sycl
```

[width="100%",%header,cols="50%,50%"]
|===
|Aspect
|Description

|`ext_oneapi_clock_sub_group`
|Indicates that the device supports the `sycl::ext::oneapi::experimental::clock<clock_scope::sub_group>()` call.
|`ext_oneapi_clock_work_group`
|Indicates that the device supports the `sycl::ext::oneapi::experimental::clock<clock_scope::work_group>()` call.
|`ext_oneapi_clock_device`
|Indicates that the device supports the `sycl::ext::oneapi::experimental::clock<clock_scope::device>()` call.
|===

=== New enum

```c++
namespace sycl::ext::oneapi::experimental {

enum class clock_scope : /* unspecified */ {
sub_group,
work_group,
device
};

}; // namespace sycl::ext::oneapi::experimental
```
An enumerator from `clock_scope` passed as a template parameter to the `clock()`
function defines the clock source:

[width="100%",%header,cols="50%,50%"]
|===
|Enumerator
|Description

|`sub_group`
|`clock()` gets values shared by all work-items executing in the same sub-group.

|`work_group`
|`clock()` gets values shared by all work-items executing in the same work-group.

|`device`
|`clock()` gets values shared by all work-items executing on the device.
|===

=== New free function

```c++
namespace sycl::ext::oneapi::experimental {

template <clock_scope scope> uint64_t clock();

} // namespace sycl::ext::oneapi::experimental
```

This function may only be called from within a SYCL kernel function.

All work-items within the `scope` read from the same source clock. There is no
guarantee that two work-items get the same value.

_Returns:_ The sample value of a clock as seen by the work-item.
The clock is defined as an unbounded, unsigned integer counter that
monotonically increments over time. The rate at which the clock advances is not
guaranteed to be constant: it may vary over the lifetime of a work-item, differ
between separate executions of the program, and be affected by conditions
outside the control of the programmer. The value returned by this instruction
corresponds to the least significant bits of the clock counter at the time of
execution. Consequently, the sampled value may wrap around zero.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide a guarantee that the clock counter starts at zero when the kernel starts executing? If so, that might help with the wrap-around problem.

Copy link
Contributor

@al42and al42and Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My 2 cents:

  1. Hardware counter wrapping is a common theme on CPUs.
  2. The prospective usecase I have in mind requires comparing counters between two kernels; resetting the device-scope counter for each kernel would break that.
  3. 64-bit nanosecond counter would need a few centuries to overflow, so wrap-around is more of a theoretical possibility. But the spec seem to assume that the counters can have less bits. A 32-bit nanosecond counter will overflow after ~4 seconds, so an overflow can occur within a kernel's lifetime.

(the above is not based on any hardware details; just trying to make some estimates)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec does not provide enough guarantees to do what you want:

  • You seem to be assuming that the clocks start at zero when the process first starts (or maybe when the process submits its first kernel). You are also assuming that the clocks are not reset between kernel submissions. However, none of this is guaranteed by the spec.

  • You are asserting that the clocks count in units of nanoseconds, but this is also not guaranteed by the spec.

To be clear, I don't think SYCL can provide any of these guarantees. The SYCL spec layers on top of SPV_KHR_shader_clock, which provides very few guarantees.

@bashbaug do you have any insight here? The SPIR-V spec (and the OpenCL C cl_khr_kernel_clock extension) don't seem to provide enough guarantees to do any useful timing operations. Do vendors provide some additional guarantees in some separate specification? Do applications just make assumptions about the clock that are not guaranteed by the spec, and the code just happens to work?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be assuming that the clocks start at zero when the process first starts (or maybe when the process submits its first kernel).

I made an assumption of this kind when estimating the upper limit on the wrap-around time. I don't see how this assumption is required for computing time difference (provided wrap-around is handled, but, unlike clock reset, it can be handled).

For clock_scope::device, the spec say: "clock() gets values shared by all work-items executing on the device.". To me, it reads like the clocks are not reset for the lifetime of the device object and also that two kernels executing simultaneously use the same clock.

You are asserting that the clocks count in units of nanoseconds, but this is also not guaranteed by the spec.

Yes, I pointed this out earlier in the discussion (and the fact that the units are not even guaranteed to be uniform). For the wrap-around estimation, 1 tick = 1 nanosecond was an assumption indeed, but based on the conversion constant used in some PTI-GPU samples. For the practical use, there is an issue recorded in the spec below, so hopefully it will be addressed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clock_scope::device, the spec say: "clock() gets values shared by all work-items executing on the device.". To me, it reads like the clocks are not reset for the lifetime of the device object and also that two kernels executing simultaneously use the same clock.

Yes, good point.

So, is the only remaining open the one about the units of the counter?

Copy link
Contributor

@al42and al42and Sep 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, is the only remaining open the one about the units of the counter?

I think it can be split into two parts, where the second one can be done without the first one:

  1. Defining the counter units: whether the ticks can be converted into seconds (and how).

  2. Guaranteeing a "steady" clock: a weaker guarantee that the clock is at least steady (each tick represents a consistent, though unknown, duration). That would still allow, e.g., benchmarking and load balancing. See, e.g., GL_EXT_shader_realtime_clock.

EDIT: From the application perspective, having the capability in the API is nice even if some hardware don't support it. At the very least, CPUs can support steady counters with known conversion factors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, this comment is resolved. I've been convinced that the API is useful even though the values it returns cannot be converted to time units. I have not clicked the "Resolve conversation" button only because @al42and has made some comments here too.

Addressing your comments ... I think we cannot provide the behavior you request on GPU because the hardware just doesn't work that way. We can count cycles (not time), and the frequency may change dynamically as the kernel executes. I agree that we could do better on CPU, but I think the motivation for adding this extension is to support GPU, and I do not think we want to invest the effort right now to implement the feature just for CPU.

Copy link
Contributor

@al42and al42and Sep 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right about the current Intel GPU hardware limitations. My concern, however, is designing a future-proof API rather than one constrained by the initial target's lowest common denominator.

I believe the API, even an experimental one, should reflect the capabilities of the entire DPC++ ecosystem. We shouldn't permanently limit the API due to the constraints of the first backend to support it, when other backends have broader capabilities.

A more flexible API doesn't require implementing it for every backend immediately. It simply prevents breaking changes later when (if) we do add support for more capable hardware. Many SYCL extensions are rolled out incrementally across backends.

To be clear, this is about the API design, not a demand for an immediate implementation on all platforms :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much easier to loosen constraints in the future than it is to add new constraints. If we have new GPU devices in the future that can provide more guarantees about the clock, then we can update this extension to a new revision and add a new aspect that provides those additional guarantees. Therefore, I don't think the extension's current wording will tie our hands in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's easier to loosen constraints than to add them. However, I see this as less about adding constraints and more about designing an API that accurately communicates diverse hardware capabilities from the start.

Deferring this decision creates unnecessary friction for developers:

  • Preventing developers from using steady clocks, even when the underlying hardware supports them, will slow down the extension adoption.

  • Experimental extensions are unversioned, so adding a new symbol (e.g., ext_oneapi_clock_device_steady aspect) forces developers to write complex code with build-system checks and #ifdefs to manage API differences.

A more flexible API from the beginning would avoid these issues and allow developers to fully use all the hardware supported by DPC++. And this extension already seems to accommodate functionality beyond what is supported by off-the-shelf Intel GPUs (#20131).


== Issues

. How to convert the result of the function to seconds?
+
*RESOLVED*: There is no portable way to convert the values returned by these
clocks.