Skip to content

Conversation

@bratpiorka
Copy link
Contributor

Changes:

  • change the log level to "error" for all errors related to adapter dependencies during UR adapter loading
  • enable the printing of UR adapter errors in sycl-ls from platform::get_platforms() using environment variables

With this PR, there will be no changes for the user in the default scenario where some adapters exist and some are missing (a missing adapter is not considered an error). In cases where a dependency is missing, the following example message will be displayed:

<LOADER>[ERROR]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: libumf.so.0: cannot open shared object file: No such file or directory

Here is the message for missing symbols (a case where a dependency is not compatible with the rest of the libraries):

<LOADER>[ERROR]: failed to load adapter 'libur_adapter_level_zero.so.0' with error: /home/rrudnick/llvm/build/lib/libumf.so.0: version `UMF_0.10' not found (required by /home/rrudnick/llvm/build/lib/libur_adapter_level_zero.so.0)

@bratpiorka bratpiorka requested a review from pbalcer March 17, 2025 16:47
@bratpiorka bratpiorka force-pushed the review/rrudnick/sycl_ls_err branch from 89a3054 to 457a1ce Compare March 18, 2025 14:26
@bratpiorka bratpiorka force-pushed the review/rrudnick/sycl_ls_err branch from 457a1ce to 3343425 Compare March 18, 2025 15:12
@bratpiorka bratpiorka marked this pull request as ready for review March 19, 2025 09:52
@bratpiorka bratpiorka requested review from a team as code owners March 19, 2025 09:52
@bratpiorka bratpiorka force-pushed the review/rrudnick/sycl_ls_err branch from 3343425 to c118a99 Compare March 19, 2025 15:40
@aelovikov-intel
Copy link
Contributor

log sycl-ls error adapter loading messages

First, needs [SYCL][UR] tags prefix.
Second, should start with upper-case ("Log ...")
Third, doesn't seem to be a proper sentence, can you re-phrase please?

@bratpiorka
Copy link
Contributor Author

First, needs [SYCL][UR] tags prefix.
Second, should start with upper-case ("Log ...")
Third, doesn't seem to be a proper sentence, can you re-phrase please?

thank you for noticing this, done

@aelovikov-intel
Copy link
Contributor

First, needs [SYCL][UR] tags prefix.
Second, should start with upper-case ("Log ...")
Third, doesn't seem to be a proper sentence, can you re-phrase please?

thank you for noticing this, done

I think all of them apply still...

@bratpiorka bratpiorka changed the title log sycl-ls error adapter loading messages [SYCL][UR] Log SYCL-LS error messages related to adapter loading Mar 20, 2025
@bratpiorka bratpiorka changed the title [SYCL][UR] Log SYCL-LS error messages related to adapter loading [SYCL][UR] Log sycl-ls error messages related to adapter loading Mar 20, 2025
@bratpiorka
Copy link
Contributor Author

@intel/llvm-gatekeepers please merge

@martygrant martygrant merged commit 39b3025 into intel:sycl Mar 21, 2025
31 checks passed
@npmiller
Copy link
Contributor

@bratpiorka I don't think an adapter missing a dependency is necessarily an error.

For example if you download a SYCL nightly build, it will have adapters for CUDA and HIP, but most users won't have both the CUDA and ROCm drivers installed, or even either of them if they're working on Intel devices.

In this case they will get confusing error messages from sycl-ls for platforms that they don't care about.

I think we should either go back to hiding that information behind the environment variable, or only showing it for sycl-ls --verbose and change the message so it's more of an information than an error.

@bratpiorka
Copy link
Contributor Author

I think we should either go back to hiding that information behind the environment variable, or only showing it for sycl-ls --verbose and change the message so it's more of an information than an error.

In theory, using only environment variables should be sufficient, but in reality, our UMF team encountered a few issues that required debugging to determine whether the problem was with the hardware or the software, and where exactly it was. And hardly anyone knew about the environment variables that can show more. My changes make things more user-friendly - if a user expects an adapter to work and it is not working, an error message is displayed.

Would you like me to revert these changes, or do you have an idea on how to get a balance between being user-friendly and over informative? One way to further improve this would be to check which devices are installed on the system and load adapters only for those devices.

@pbalcer
Copy link
Contributor

pbalcer commented Mar 25, 2025

One way to further improve this would be to check which devices are installed on the system and load adapters only for those devices.

That would require either the adapter to be loaded (for platform discovery) or something like libhwloc. I don't know whether that's realistic doable.

I think hiding this behind a --verbose is the easiest path forward for now.

@rafbiels
Copy link
Contributor

if a user expects an adapter to work and it is not working, an error message is displayed

However, if a user doesn't expect an adapter to work, they are still faced with errors. That's a much more common use case.
I'm certain almost no user of open-source releases expects all included adapters to work, since these releases include OpenCL, Level Zero, CUDA and HIP adapters. Most users won't have all the corresponding libraries and devices to use all of them on one machine and will be faced with errors. This refers to nightly binaries today, but soon enough will be also the case for releases like v6.0.0 which are also planned to provide binaries to download.

Example: I just downloaded nightly-2025-03-25 and the first thing sycl-ls prints on my Intel CPU + NVIDIA GPU machine is two ERRORs about AMD.

@npmiller
Copy link
Contributor

In theory, using only environment variables should be sufficient, but in reality, our UMF team encountered a few issues that required debugging to determine whether the problem was with the hardware or the software, and where exactly it was. And hardly anyone knew about the environment variables that can show more.

It could probably be better documented, for what it's worth in the CUDA/HIP release documentation we explain exactly this (see here).

My changes make things more user-friendly - if a user expects an adapter to work and it is not working, an error message is displayed.

It does in specific cases, but I feel like in most cases it's the opposite as it'll spam users with error messages they don't care about. I do agree that it's very confusing when an adapter doesn't load because of a dependency though.

Would you like me to revert these changes, or do you have an idea on how to get a balance between being user-friendly and over informative?

I think printing additional information in sycl-ls --verbose would be an ideal compromise.

I'd suggest making the change if it's quick and you have time, otherwise we should just revert this for now, and maybe file a ticket for the follow up work.

@RossBrunton
Copy link
Contributor

Would it also make sense for errors to be logged if ONEAPI_DEVICE_SELECTOR is set to a specific adapter? Anyone who requests a specific selector should naturally expect that adapter to work, and it feels like setting this environment variable to something would be a first troubleshooting step for anyone trying to debug their installation.

@bratpiorka
Copy link
Contributor Author

@pbalcer @rafbiels @npmiller @RossBrunton please look at #17651

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants