Skip to content

Conversation

jhuber6
Copy link
Contributor

@jhuber6 jhuber6 commented Oct 13, 2025

Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.

This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.

Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?

Fixes: #159636

@llvmbot
Copy link
Member

llvmbot commented Oct 13, 2025

@llvm/pr-subscribers-offload

Author: Joseph Huber (jhuber6)

Changes

Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.

This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.

Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?

Fixes: #159636


Full diff: https://github.com/llvm/llvm-project/pull/163272.diff

2 Files Affected:

  • (modified) offload/liboffload/API/Common.td (+2-1)
  • (modified) offload/liboffload/src/OffloadImpl.cpp (+86-38)
diff --git a/offload/liboffload/API/Common.td b/offload/liboffload/API/Common.td
index ac27d85b6c964..b47223612479a 100644
--- a/offload/liboffload/API/Common.td
+++ b/offload/liboffload/API/Common.td
@@ -140,9 +140,10 @@ def ol_dimensions_t : Struct {
 }
 
 def olInit : Function {
-  let desc = "Perform initialization of the Offload library and plugins";
+  let desc = "Perform initialization of the Offload library";
   let details = [
     "This must be the first API call made by a user of the Offload library",
+    "The underlying platforms are lazily initialized on their first use"
     "Each call will increment an internal reference count that is decremented by `olShutDown`"
   ];
   let params = [];
diff --git a/offload/liboffload/src/OffloadImpl.cpp b/offload/liboffload/src/OffloadImpl.cpp
index c549ae04361d0..481965249b2eb 100644
--- a/offload/liboffload/src/OffloadImpl.cpp
+++ b/offload/liboffload/src/OffloadImpl.cpp
@@ -42,17 +42,40 @@ using namespace error;
 struct ol_platform_impl_t {
   ol_platform_impl_t(std::unique_ptr<GenericPluginTy> Plugin,
                      ol_platform_backend_t BackendType)
-      : Plugin(std::move(Plugin)), BackendType(BackendType) {}
-  std::unique_ptr<GenericPluginTy> Plugin;
-  llvm::SmallVector<std::unique_ptr<ol_device_impl_t>> Devices;
+      : BackendType(BackendType), Plugin(std::move(Plugin)) {}
   ol_platform_backend_t BackendType;
 
+  /// Get the plugin, lazily initializing it if necessary.
+  llvm::Expected<GenericPluginTy *> getPlugin() {
+    if (llvm::Error Err = init())
+      return Err;
+    return Plugin.get();
+  }
+
+  /// Get the device list, lazily initializing it if necessary.
+  llvm::Expected<llvm::SmallVector<std::unique_ptr<ol_device_impl_t>> &>
+  getDevices() {
+    if (llvm::Error Err = init())
+      return Err;
+    return Devices;
+  }
+
   /// Complete all pending work for this platform and perform any needed
   /// cleanup.
   ///
   /// After calling this function, no liboffload functions should be called with
   /// this platform handle.
   llvm::Error destroy();
+
+  /// Initialize the associated plugin and devices.
+  llvm::Error init();
+
+  /// Direct access to the plugin, may be uninitialized if accessed here.
+  std::unique_ptr<GenericPluginTy> Plugin;
+
+private:
+  std::once_flag Initialized;
+  llvm::SmallVector<std::unique_ptr<ol_device_impl_t>> Devices;
 };
 
 // Handle type definitions. Ideally these would be 1:1 with the plugins, but
@@ -130,6 +153,39 @@ llvm::Error ol_platform_impl_t::destroy() {
   return Result;
 }
 
+llvm::Error ol_platform_impl_t::init() {
+  std::unique_ptr<llvm::Error> Storage;
+
+  // This can be called concurrently, make sure we only do the actual
+  // initialization once.
+  std::call_once(Initialized, [&]() {
+    // FIXME: Need better handling for the host platform.
+    if (!Plugin)
+      return;
+
+    llvm::Error Err = Plugin->init();
+    if (Err) {
+      Storage = std::make_unique<llvm::Error>(std::move(Err));
+      return;
+    }
+
+    for (auto DevNum = 0; DevNum < Plugin->number_of_devices(); DevNum++) {
+      if (Plugin->init_device(DevNum) == OFFLOAD_SUCCESS) {
+        auto Device = &Plugin->getDevice(DevNum);
+        auto Info = Device->obtainInfoImpl();
+        if (llvm::Error Err = Info.takeError()) {
+          Storage = std::make_unique<llvm::Error>(std::move(Err));
+          return;
+        }
+        Devices.emplace_back(std::make_unique<ol_device_impl_t>(
+            DevNum, Device, *this, std::move(*Info)));
+      }
+    }
+  });
+
+  return Storage ? std::move(*Storage) : llvm::Error::success();
+}
+
 struct ol_queue_impl_t {
   ol_queue_impl_t(__tgt_async_info *AsyncInfo, ol_device_handle_t Device)
       : AsyncInfo(AsyncInfo), Device(Device), Id(IdCounter++) {}
@@ -209,13 +265,9 @@ struct OffloadContext {
   // key in AllocInfoMap
   llvm::SmallVector<void *> AllocBases{};
   SmallVector<std::unique_ptr<ol_platform_impl_t>, 4> Platforms{};
+  ol_device_handle_t HostDevice;
   size_t RefCount;
 
-  ol_device_handle_t HostDevice() {
-    // The host platform is always inserted last
-    return Platforms.back()->Devices[0].get();
-  }
-
   static OffloadContext &get() {
     assert(OffloadContextVal);
     return *OffloadContextVal;
@@ -259,28 +311,16 @@ Error initPlugins(OffloadContext &Context) {
   } while (false);
 #include "Shared/Targets.def"
 
-  // Preemptively initialize all devices in the plugin
-  for (auto &Platform : Context.Platforms) {
-    auto Err = Platform->Plugin->init();
-    [[maybe_unused]] std::string InfoMsg = toString(std::move(Err));
-    for (auto DevNum = 0; DevNum < Platform->Plugin->number_of_devices();
-         DevNum++) {
-      if (Platform->Plugin->init_device(DevNum) == OFFLOAD_SUCCESS) {
-        auto Device = &Platform->Plugin->getDevice(DevNum);
-        auto Info = Device->obtainInfoImpl();
-        if (auto Err = Info.takeError())
-          return Err;
-        Platform->Devices.emplace_back(std::make_unique<ol_device_impl_t>(
-            DevNum, Device, *Platform, std::move(*Info)));
-      }
-    }
-  }
-
   // Add the special host device
   auto &HostPlatform = Context.Platforms.emplace_back(
       std::make_unique<ol_platform_impl_t>(nullptr, OL_PLATFORM_BACKEND_HOST));
-  HostPlatform->Devices.emplace_back(std::make_unique<ol_device_impl_t>(
-      -1, nullptr, *HostPlatform, InfoTreeNode{}));
+  auto DevicesOrErr = HostPlatform->getDevices();
+  if (!DevicesOrErr)
+    return DevicesOrErr.takeError();
+  Context.HostDevice = DevicesOrErr
+                           ->emplace_back(std::make_unique<ol_device_impl_t>(
+                               -1, nullptr, *HostPlatform, InfoTreeNode{}))
+                           .get();
 
   Context.TracingEnabled = std::getenv("OFFLOAD_TRACE");
   Context.ValidationEnabled = !std::getenv("OFFLOAD_DISABLE_VALIDATION");
@@ -315,12 +355,12 @@ Error olShutDown_impl() {
   llvm::Error Result = Error::success();
   auto *OldContext = OffloadContextVal.exchange(nullptr);
 
-  for (auto &P : OldContext->Platforms) {
+  for (auto &Platform : OldContext->Platforms) {
     // Host plugin is nullptr and has no deinit
-    if (!P->Plugin || !P->Plugin->is_initialized())
+    if (!Platform->Plugin || !Platform->Plugin->is_initialized())
       continue;
 
-    if (auto Res = P->destroy())
+    if (auto Res = Platform->destroy())
       Result = llvm::joinErrors(std::move(Result), std::move(Res));
   }
 
@@ -334,9 +374,14 @@ Error olGetPlatformInfoImplDetail(ol_platform_handle_t Platform,
   InfoWriter Info(PropSize, PropValue, PropSizeRet);
   bool IsHost = Platform->BackendType == OL_PLATFORM_BACKEND_HOST;
 
+  auto PluginOrErr = Platform->getPlugin();
+  if (!PluginOrErr)
+    return PluginOrErr.takeError();
+  GenericPluginTy *Plugin = *PluginOrErr;
+
   switch (PropName) {
   case OL_PLATFORM_INFO_NAME:
-    return Info.writeString(IsHost ? "Host" : Platform->Plugin->getName());
+    return Info.writeString(IsHost ? "Host" : Plugin->getName());
   case OL_PLATFORM_INFO_VENDOR_NAME:
     // TODO: Implement this
     return Info.writeString("Unknown platform vendor");
@@ -373,7 +418,7 @@ Error olGetPlatformInfoSize_impl(ol_platform_handle_t Platform,
 Error olGetDeviceInfoImplDetail(ol_device_handle_t Device,
                                 ol_device_info_t PropName, size_t PropSize,
                                 void *PropValue, size_t *PropSizeRet) {
-  assert(Device != OffloadContext::get().HostDevice());
+  assert(Device != OffloadContext::get().HostDevice);
   InfoWriter Info(PropSize, PropValue, PropSizeRet);
 
   auto makeError = [&](ErrorCode Code, StringRef Err) {
@@ -511,7 +556,7 @@ Error olGetDeviceInfoImplDetail(ol_device_handle_t Device,
 Error olGetDeviceInfoImplDetailHost(ol_device_handle_t Device,
                                     ol_device_info_t PropName, size_t PropSize,
                                     void *PropValue, size_t *PropSizeRet) {
-  assert(Device == OffloadContext::get().HostDevice());
+  assert(Device == OffloadContext::get().HostDevice);
   InfoWriter Info(PropSize, PropValue, PropSizeRet);
 
   constexpr auto uint32_max = std::numeric_limits<uint32_t>::max();
@@ -579,7 +624,7 @@ Error olGetDeviceInfoImplDetailHost(ol_device_handle_t Device,
 
 Error olGetDeviceInfo_impl(ol_device_handle_t Device, ol_device_info_t PropName,
                            size_t PropSize, void *PropValue) {
-  if (Device == OffloadContext::get().HostDevice())
+  if (Device == OffloadContext::get().HostDevice)
     return olGetDeviceInfoImplDetailHost(Device, PropName, PropSize, PropValue,
                                          nullptr);
   return olGetDeviceInfoImplDetail(Device, PropName, PropSize, PropValue,
@@ -588,7 +633,7 @@ Error olGetDeviceInfo_impl(ol_device_handle_t Device, ol_device_info_t PropName,
 
 Error olGetDeviceInfoSize_impl(ol_device_handle_t Device,
                                ol_device_info_t PropName, size_t *PropSizeRet) {
-  if (Device == OffloadContext::get().HostDevice())
+  if (Device == OffloadContext::get().HostDevice)
     return olGetDeviceInfoImplDetailHost(Device, PropName, 0, nullptr,
                                          PropSizeRet);
   return olGetDeviceInfoImplDetail(Device, PropName, 0, nullptr, PropSizeRet);
@@ -596,9 +641,12 @@ Error olGetDeviceInfoSize_impl(ol_device_handle_t Device,
 
 Error olIterateDevices_impl(ol_device_iterate_cb_t Callback, void *UserData) {
   for (auto &Platform : OffloadContext::get().Platforms) {
-    for (auto &Device : Platform->Devices) {
+    auto DevicesOrErr = Platform->getDevices();
+    if (!DevicesOrErr)
+      return DevicesOrErr.takeError();
+    for (auto &Device : *DevicesOrErr) {
       if (!Callback(Device.get(), UserData)) {
-        break;
+        return Error::success();
       }
     }
   }
@@ -949,7 +997,7 @@ Error olCreateEvent_impl(ol_queue_handle_t Queue, ol_event_handle_t *EventOut) {
 Error olMemcpy_impl(ol_queue_handle_t Queue, void *DstPtr,
                     ol_device_handle_t DstDevice, const void *SrcPtr,
                     ol_device_handle_t SrcDevice, size_t Size) {
-  auto Host = OffloadContext::get().HostDevice();
+  auto Host = OffloadContext::get().HostDevice;
   if (DstDevice == Host && SrcDevice == Host) {
     if (!Queue) {
       std::memcpy(DstPtr, SrcPtr, Size);

}

llvm::Error ol_platform_impl_t::init() {
std::unique_ptr<llvm::Error> Storage;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it needed to use dynamic memory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm aware, if you pass in a pointer to local you'd get errors for that local not being handled. So you need something nullable. It will only dynamically allocate in the case of an error, so I don't think performance is an issue there.

@jplehr jplehr requested a review from Kewen12 October 14, 2025 07:56
Copy link
Contributor

@adurang adurang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.

This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.

Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?

Fixes: llvm#159636
@jhuber6 jhuber6 merged commit 4a35c4d into llvm:main Oct 14, 2025
10 checks passed
@nico
Copy link
Contributor

nico commented Oct 14, 2025

Any chance that this is causing http://45.33.8.238/macm1/115570/step_10.txt ?

Am I holding something wrong?

(Linux/x64 seems to be happy.)

@jhuber6
Copy link
Contributor Author

jhuber6 commented Oct 14, 2025

Any chance that this is causing http://45.33.8.238/macm1/115570/step_10.txt ?

Am I holding something wrong?

(Linux/x64 seems to be happy.)

Completely unrelated

@jhuber6
Copy link
Contributor Author

jhuber6 commented Oct 14, 2025

Unfortunately I might need to revert this. There's some extremely weird interactions here when you destruct the runtime library in a global destructor like we do in the tests. Essentially, if you initialize the platform runtime after main has run but then call destroy it in a global destructor, some global resources may have already been freed by the time the global constructor runs.

This issue happens because if you initialize it in a global constructor it will do it lazily and the trigger the actual initialization in main. Because this is interacting with the underlying platform (and I know that CUDA does a lot of weird teardown magic we have no control over.) I can't imagine an easy fix to this. The only thing I can think of is if we hack a global constructor into the runtime to detect if we're firing before main, but that would be quite awful.

Unfortunately this puts me in an awkward place, because I was hoping for this to be a minimally invasive solution to avoiding unnecessary overhead... Honestly might just need to come down to providing a separate olInit variant that takes a list of desired platforms.

jhuber6 added a commit that referenced this pull request Oct 14, 2025
…163272)

Summary:
This causes issues with CUDA's teardown order when the init is separated
from the total init scope.
akadutta pushed a commit to akadutta/llvm-project that referenced this pull request Oct 14, 2025
)

Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.

This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.

Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?

Fixes: llvm#159636
akadutta pushed a commit to akadutta/llvm-project that referenced this pull request Oct 14, 2025
…lvm#163272)

Summary:
This causes issues with CUDA's teardown order when the init is separated
from the total init scope.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Offload] Offloading API always initializes all plugins and devices

5 participants