-
Notifications
You must be signed in to change notification settings - Fork 792
[CI][Bench] Double-check that the device being used is the device we want #20383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
Another option could be to use |
echo "Searching for a $MACHINE_TYPE device..." | ||
explicit_device_sel="" | ||
sycl_ls_out="$(mktemp)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am against adding this code in this way.
It is complex, not automatically tested and once we commit it we will be afraid of touching this code worrying of not breaking it.
Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.
If we really need a code that automatically creates ONEAPI_DEVICE_SELECTOR out of MACHINE_TYPE and sycl-ls output then please let's create a testable code.
Below proposition in python. You may do similar in bash if you know how to nicely test it. I don't know, thus proposition in python.
- create a script get_oneapi_device_selector.py taking as argument MACHINE_TYPE
- inside create two functions:
- function get_oneapi_device_selector( sycl_ls_outpout : string, machine_type : string) returning string, and...
- function main executing sycl-ls and passing it to get_oneapi_device_selector
- write python doctests for get_oneapi_device_selector function. It will be a nice test as it takes strings and outputs string. They will also well document what this selector we expect to be having given input. Two cases will be enough - one with actual BMG sycl-ls output and one with PVC output. Of course we may add more (one card, two cards, no cards). Main function may be left untested - it just calls get_oneapi_device_selector function.
- make doctests that you created beging executed in our CI
This is (a bit) heavy, but there is too much technical debt in benchmarking scripts related to skipping unittests and I'd like to propose we do not produce this debt more. Once done this way we will have doctests working, which would be nice.
Again, I think just hardcoding selectors is good too (or even better). If you think a script is really a better solution, but we don't have time to write it in testable way now, then maybe hardcode selectors for now and leave TODO with pointer to this discussion and to your current commit, which we plan to turn into tested script in the future.
update: The easiest, (ugly, but acceptable because tested by doctests) way of writing get_oneapi_device_selector.py would be calling your bash code within get_oneapi_device_selector function. If in hurry, we may consider this way and refactor in the future. Refactor will be a pleasure, as we will already have tests :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its goal is to produce right ONEAPI_DEVICE_SELECTOR for our two machines where we run tests on two GPUs.
The simplest solution will be just hardcode that selector here depending on MACHINE_TYPE or machine name. Simple, readable.
I avoided hardcoding because:
- In case numbering/output of
sycl-ls
changes (i.e., there are discussions right now for changes potentially changing numbering insycl-ls
), then hardcoding values could result in the wrong GPU being used - If a single device exists per backend,
sycl-ls
defaults to no device identifiers; one would just use e.g.level_zero:gpu
. In the case that a runner is changed, such a value could also potentially result in the wrong GPU being used.
But, your concerns w.r.t. a lack of testing is entirely valid. A lack of testing outside of a precommit test-run of our code is not exactly reassuring. Given the slightly problematic nature of sycl-ls
mentioned above and your objections, perhaps hardcoding to use SYCL_DEVICE_ALLOWLIST
as Udit mentioned is a better and simpler solution instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah... the only way to ensure what device you are using with SYCL_DEVICE_ALLOWLIST
is to rely on DeviceName
, which is deprecated... If we were to go for the env vars path, I'd need to add this it ONEAPI_DEVICE_SELECTOR
(and consult stakeholders, get approvals, etc.; unfortunately my internship is ending, and I won't have the time for this.)
The other solution is to turn this code into python instead; I will make the utilities and the doctests, but running these doctests in precommit will have to be a separate PR.
'["PVC_PERF"]') MACHINE_TYPE="PVC" ;; | ||
'["BMG_PERF"]') MACHINE_TYPE="BMG" ;; | ||
# Best effort at matching | ||
*) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By chance, since we touch near this code.
Could you please remove "best effort" and fail in unknown case? This is dead code, non PVC_PERF/BMG_PERF conditions should never happen and nobody knows what will happen if it will pass through.
# TODO: in terms of security, is this overkill? | ||
if [ -z "$(printf '%s' "$RUNNER_NAME" | grep -oE '^[a-zA-Z0-9_-]+$')" ]; then | ||
echo "Bad runner name, please ensure runner name is [a-zA-Z0-9_-]." | ||
echo "**Error:** Bad runner name '$RUNNER_NAME', please ensure runner name is [a-zA-Z0-9_-]." >> $GITHUB_STEP_SUMMARY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this if and fail right in above case default branch when RUNNER_TAG isn't PVC_PERF or BMG_PERF. Is there a case when RUNNER_TAG may be different than them?
There are concerns that there may be multiple GPUs available on a performance runner, i.e. incase an integrated GPU accidentally gets picked. This PR introduces a script to parse for the correct
ONEAPI_DEVICE_SELECTOR
value to use.Test runs: