Skip to content

Conversation

theHamsta
Copy link

@theHamsta theHamsta commented Sep 30, 2025

Description

This change adds vendor id and PCIe bus_id to the properties detected during Linux device discovery.

Those properties are used to enable device discovery on Linux for the TRT-RTX execution provider.

Motivation and Context

I want to use device discovery for TRT-EP also on Linux.

This changes have already been tested with the newly added inference samples microsoft/onnxruntime-inference-examples#529 . Some further testing is required to see whether this repo's CI passes (therefore setting draft for now)

@gedoensmax for visibilty

@theHamsta theHamsta marked this pull request as draft September 30, 2025 16:28

// metadata
gpu_device.metadata.Add("card_idx", MakeString(path_info.card_idx));
gpu_device.metadata.Add("bus_id", std::filesystem::read_symlink(sysfs_path / "device").filename().string()); // e.g. 0000:65:00.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to infer whether the GPU is discrete from the bus_id? I'm trying to figure out how to determine that - any pointers would be appreciated.

Copy link
Author

@theHamsta theHamsta Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I don't know whether this is reliable, but it seems that the parent bus of a iGPU seems to have a parent bus with id of pattern picXXXX:00. dGPU seems to have names like card1 or higher while iGPU would be named card0. So it's rather the topology of the bus than the bus id of the GPU itself

dGPU

realpath /sys/class/drm/card1/device 
/sys/devices/pci0000:64/0000:64:00.0/0000:65:00.0

iGPUs would often have the parent path and be named card0

/sys/devices/picXXXX:00/XXXX:00:00.0/

I don't have a iGPU so my node

00:00.0 Host bridge: Intel Corporation Sky Lake-E DMI3 Registers (rev 07)

or

/sys/devices/pci0000:00/

wouldn't have a any drm nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to know, thanks for sharing that observation

@theHamsta theHamsta force-pushed the sseitz/linux-device-discovery branch 3 times, most recently from 8e33417 to bc7f4bb Compare October 1, 2025 08:14
// Tests autoEP feature to automatically select an EP that supports the GPU.
// Currently only works on Windows.
TEST(NvExecutionProviderTest, AutoEp_PreferGpu) {
PathString model_name = ORT_TSTR("nv_execution_provider_auto_ep.onnx");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test is currently still failing: NvTensorRtRtxEpFactory::CreateEp is called but not implemented https://github.com/theHamsta/onnxruntime/blob/bc7f4bbc92ed0609e18e1c4c80c3c8d560ce1729/onnxruntime/core/providers/nv_tensorrt_rtx/nv_provider_factory.cc#L698. The other test passes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where would would the internal factories on Windows come from
https://github.com/theHamsta/onnxruntime/blob/20d2c8912708b37fb914fbb0c719914a4040830c/onnxruntime/core/session/provider_policy_context.cc#L351-L358 ? I don't have them on Linux when no execution providers are appended to the session options. My desired device get selected, but NvEp::CreateEp is called.

NvEp like other EPs does not implement CreateEp. I guess IExecutionProvider.CreateProviders is the modern version of it?

@theHamsta theHamsta force-pushed the sseitz/linux-device-discovery branch 3 times, most recently from 8820257 to 6c77185 Compare October 13, 2025 14:19
@theHamsta theHamsta force-pushed the sseitz/linux-device-discovery branch from 824dc92 to 5f49245 Compare October 14, 2025 15:31
@theHamsta theHamsta force-pushed the sseitz/linux-device-discovery branch from 5f49245 to 0bf0c6b Compare October 14, 2025 15:39
#endif
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edgchen1 This resolves the failing test under Linux. I could alternatively put std::filesystem::read_symlink(std::filesystem::path("/proc/self/exe")).parent_path() in the unit test instead and move PosixEnv::GetRuntimePath to a separate PR.

I'm not sure whether the Windows behavior to set runtime path to current exe's dir is always desired (but it would be needed to behave the same in tests like on Windows)

@theHamsta theHamsta marked this pull request as ready for review October 14, 2025 15:41
PathString GetRuntimePath() const override {
Dl_info dl_info{};
// Must be one of the symbols exported in libonnxruntime.{so,dynlib}.
void* symbol_from_this_library = dlsym(RTLD_DEFAULT, "OrtGetApiBase");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be an exported symbol. onnxruntim_core.a does not have exported symbols by itself (we could add one though), only with onnxruntime.so

We try to follow the implementation of WindowsEnv which resolves the
runtime directory as the directory of the current DLL (onnxruntime) or
executable for static linkage.
@theHamsta theHamsta force-pushed the sseitz/linux-device-discovery branch from 0bf0c6b to b4a38fc Compare October 14, 2025 15:52
@theHamsta theHamsta requested a review from edgchen1 October 16, 2025 10:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants