[CI][Bench] Double-check that the device being used is the device we want #20383

ianayl · 2025-10-16T20:03:39Z

There are concerns that there may be multiple GPUs available on a performance runner, i.e. incase an integrated GPU accidentally gets picked. This PR introduces a script to parse for the correct ONEAPI_DEVICE_SELECTOR value to use.

Test runs:

uditagarwal97 · 2025-10-16T20:49:56Z

Another option could be to use SYCL_DEVICE_ALLOWLIST to filter devices based on device name.

lslusarczyk · 2025-10-17T08:17:54Z

devops/actions/run-tests/benchmark/action.yml

+      echo "Searching for a $MACHINE_TYPE device..."
+      explicit_device_sel=""
+
+      sycl_ls_out="$(mktemp)"


I am against adding this code in this way.
It is complex, not automatically tested and once we commit it we will be afraid of touching this code worrying of not breaking it.

Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.

If we really need a code that automatically creates ONEAPI_DEVICE_SELECTOR out of MACHINE_TYPE and sycl-ls output then please let's create a testable code.

Below proposition in python. You may do similar in bash if you know how to nicely test it. I don't know, thus proposition in python.

create a script get_oneapi_device_selector.py taking as argument MACHINE_TYPE

inside create two functions:

function get_oneapi_device_selector( sycl_ls_outpout : string, machine_type : string) returning string, and...

function main executing sycl-ls and passing it to get_oneapi_device_selector

write python doctests for get_oneapi_device_selector function. It will be a nice test as it takes strings and outputs string. They will also well document what this selector we expect to be having given input. Two cases will be enough - one with actual BMG sycl-ls output and one with PVC output. Of course we may add more (one card, two cards, no cards). Main function may be left untested - it just calls get_oneapi_device_selector function.

make doctests that you created beging executed in our CI

This is (a bit) heavy, but there is too much technical debt in benchmarking scripts related to skipping unittests and I'd like to propose we do not produce this debt more. Once done this way we will have doctests working, which would be nice.

Again, I think just hardcoding selectors is good too (or even better). If you think a script is really a better solution, but we don't have time to write it in testable way now, then maybe hardcode selectors for now and leave TODO with pointer to this discussion and to your current commit, which we plan to turn into tested script in the future.

update: The easiest, (ugly, but acceptable because tested by doctests) way of writing get_oneapi_device_selector.py would be calling your bash code within get_oneapi_device_selector function. If in hurry, we may consider this way and refactor in the future. Refactor will be a pleasure, as we will already have tests :)

Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.

I avoided hardcoding because:

In case numbering/output of sycl-ls changes (i.e., there are discussions right now for changes potentially changing numbering in sycl-ls), then hardcoding values could result in the wrong GPU being used

If a single device exists per backend, sycl-ls defaults to no device identifiers; one would just use e.g. level_zero:gpu. In the case that a runner is changed, such a value could also potentially result in the wrong GPU being used.

But, your concerns w.r.t. a lack of testing is entirely valid. A lack of testing outside of a precommit test-run of our code is not exactly reassuring. Given the slightly problematic nature of sycl-ls mentioned above and your objections, perhaps hardcoding to use SYCL_DEVICE_ALLOWLIST as Udit mentioned is a better and simpler solution instead.

Ah... the only way to ensure what device you are using with SYCL_DEVICE_ALLOWLIST is to rely on DeviceName, which is deprecated... If we were to go for the env vars path, I'd need to add this it ONEAPI_DEVICE_SELECTOR (and consult stakeholders, get approvals, etc.; unfortunately my internship is ending, and I won't have the time for this.)

The other solution is to turn this code into python instead; I will make the utilities and the doctests, but running these doctests in precommit will have to be a separate PR.

lslusarczyk · 2025-10-17T08:20:17Z

devops/actions/run-tests/benchmark/action.yml

        '["PVC_PERF"]') MACHINE_TYPE="PVC" ;;
        '["BMG_PERF"]') MACHINE_TYPE="BMG" ;;
        # Best effort at matching
        *)


By chance, since we touch near this code.
Could you please remove "best effort" and fail in unknown case? This is dead code, non PVC_PERF/BMG_PERF conditions should never happen and nobody knows what will happen if it will pass through.

lslusarczyk · 2025-10-17T08:49:34Z

devops/actions/run-tests/benchmark/action.yml

      # TODO: in terms of security, is this overkill?
      if [ -z "$(printf '%s' "$RUNNER_NAME" | grep -oE '^[a-zA-Z0-9_-]+$')" ]; then
          echo "Bad runner name, please ensure runner name is [a-zA-Z0-9_-]."
+          echo "**Error:** Bad runner name '$RUNNER_NAME', please ensure runner name is [a-zA-Z0-9_-]." >> $GITHUB_STEP_SUMMARY


Let's remove this if and fail right in above case default branch when RUNNER_TAG isn't PVC_PERF or BMG_PERF. Is there a case when RUNNER_TAG may be different than them?

ianayl added 4 commits October 2, 2025 14:32

Double check device being used is not an integrated GPU

7e81388

TEST: check output

12c6bee

TEST: check sycl-ls

55f69a5

fix bug with 1 device

93e2127

ianayl requested a review from a team as a code owner October 16, 2025 20:03

ianayl temporarily deployed to WindowsCILock October 16, 2025 20:03 — with GitHub Actions Inactive

uditagarwal97 requested a review from a team October 16, 2025 20:05

ianayl temporarily deployed to WindowsCILock October 16, 2025 20:40 — with GitHub Actions Inactive

lslusarczyk requested changes Oct 17, 2025

View reviewed changes

lslusarczyk reviewed Oct 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CI][Bench] Double-check that the device being used is the device we want #20383

[CI][Bench] Double-check that the device being used is the device we want #20383

Uh oh!

ianayl commented Oct 16, 2025

Uh oh!

uditagarwal97 commented Oct 16, 2025

Uh oh!

lslusarczyk Oct 17, 2025 •

edited

Loading

Uh oh!

ianayl Oct 17, 2025

Uh oh!

ianayl Oct 17, 2025

Uh oh!

lslusarczyk Oct 17, 2025

Uh oh!

lslusarczyk Oct 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[CI][Bench] Double-check that the device being used is the device we want #20383

Are you sure you want to change the base?

[CI][Bench] Double-check that the device being used is the device we want #20383

Uh oh!

Conversation

ianayl commented Oct 16, 2025

Uh oh!

uditagarwal97 commented Oct 16, 2025

Uh oh!

lslusarczyk Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ianayl Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

ianayl Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

lslusarczyk Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

lslusarczyk Oct 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lslusarczyk Oct 17, 2025 •

edited

Loading