Skip to content

Conversation

ianayl
Copy link
Contributor

@ianayl ianayl commented Oct 16, 2025

There are concerns that there may be multiple GPUs available on a performance runner, i.e. incase an integrated GPU accidentally gets picked. This PR introduces a script to parse for the correct ONEAPI_DEVICE_SELECTOR value to use.

Test runs:

@ianayl ianayl requested a review from a team as a code owner October 16, 2025 20:03
@uditagarwal97 uditagarwal97 requested a review from a team October 16, 2025 20:05
@uditagarwal97
Copy link
Contributor

Another option could be to use SYCL_DEVICE_ALLOWLIST to filter devices based on device name.

echo "Searching for a $MACHINE_TYPE device..."
explicit_device_sel=""
sycl_ls_out="$(mktemp)"
Copy link
Contributor

@lslusarczyk lslusarczyk Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am against adding this code in this way.
It is complex, not automatically tested and once we commit it we will be afraid of touching this code worrying of not breaking it.

Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.

If we really need a code that automatically creates ONEAPI_DEVICE_SELECTOR out of MACHINE_TYPE and sycl-ls output then please let's create a testable code.

Below proposition in python. You may do similar in bash if you know how to nicely test it. I don't know, thus proposition in python.

  1. create a script get_oneapi_device_selector.py taking as argument MACHINE_TYPE
  2. inside create two functions:
  3. function get_oneapi_device_selector( sycl_ls_outpout : string, machine_type : string) returning string, and...
  4. function main executing sycl-ls and passing it to get_oneapi_device_selector
  5. write python doctests for get_oneapi_device_selector function. It will be a nice test as it takes strings and outputs string. They will also well document what this selector we expect to be having given input. Two cases will be enough - one with actual BMG sycl-ls output and one with PVC output. Of course we may add more (one card, two cards, no cards). Main function may be left untested - it just calls get_oneapi_device_selector function.
  6. make doctests that you created beging executed in our CI

This is (a bit) heavy, but there is too much technical debt in benchmarking scripts related to skipping unittests and I'd like to propose we do not produce this debt more. Once done this way we will have doctests working, which would be nice.

Again, I think just hardcoding selectors is good too (or even better). If you think a script is really a better solution, but we don't have time to write it in testable way now, then maybe hardcode selectors for now and leave TODO with pointer to this discussion and to your current commit, which we plan to turn into tested script in the future.

update: The easiest, (ugly, but acceptable because tested by doctests) way of writing get_oneapi_device_selector.py would be calling your bash code within get_oneapi_device_selector function. If in hurry, we may consider this way and refactor in the future. Refactor will be a pleasure, as we will already have tests :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.

I avoided hardcoding because:

  • In case numbering/output of sycl-ls changes (i.e., there are discussions right now for changes potentially changing numbering in sycl-ls), then hardcoding values could result in the wrong GPU being used
  • If a single device exists per backend, sycl-ls defaults to no device identifiers; one would just use e.g. level_zero:gpu. In the case that a runner is changed, such a value could also potentially result in the wrong GPU being used.

But, your concerns w.r.t. a lack of testing is entirely valid. A lack of testing outside of a precommit test-run of our code is not exactly reassuring. Given the slightly problematic nature of sycl-ls mentioned above and your objections, perhaps hardcoding to use SYCL_DEVICE_ALLOWLIST as Udit mentioned is a better and simpler solution instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah... the only way to ensure what device you are using with SYCL_DEVICE_ALLOWLIST is to rely on DeviceName, which is deprecated... If we were to go for the env vars path, I'd need to add this it ONEAPI_DEVICE_SELECTOR (and consult stakeholders, get approvals, etc.; unfortunately my internship is ending, and I won't have the time for this.)

The other solution is to turn this code into python instead; I will make the utilities and the doctests, but running these doctests in precommit will have to be a separate PR.

'["PVC_PERF"]') MACHINE_TYPE="PVC" ;;
'["BMG_PERF"]') MACHINE_TYPE="BMG" ;;
# Best effort at matching
*)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By chance, since we touch near this code.
Could you please remove "best effort" and fail in unknown case? This is dead code, non PVC_PERF/BMG_PERF conditions should never happen and nobody knows what will happen if it will pass through.

# TODO: in terms of security, is this overkill?
if [ -z "$(printf '%s' "$RUNNER_NAME" | grep -oE '^[a-zA-Z0-9_-]+$')" ]; then
echo "Bad runner name, please ensure runner name is [a-zA-Z0-9_-]."
echo "**Error:** Bad runner name '$RUNNER_NAME', please ensure runner name is [a-zA-Z0-9_-]." >> $GITHUB_STEP_SUMMARY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this if and fail right in above case default branch when RUNNER_TAG isn't PVC_PERF or BMG_PERF. Is there a case when RUNNER_TAG may be different than them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants