[Offload] Lazily initialize platforms in the Offloading API #163272

jhuber6 · 2025-10-13T21:52:56Z

Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.

This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.

Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?

Fixes: #159636

llvmbot · 2025-10-13T21:53:30Z

@llvm/pr-subscribers-offload

Author: Joseph Huber (jhuber6)

Changes

Summary:
The Offloading library wraps around the underlying plugins. The problem
is that we currently initialize all plugins we find, even if they are
not needed for the program. This is very expensive for trivial uses, as
fully heterogenous usage is quite rare. In practice this means that you
will always pay a 200 ms penalty for having CUDA installed.

This patch changes the behavior to provide accessors into the plugins
and devices that allows them to be initialized lazily. We use a
once_flag, this should properly take a fast-path check while still
blocking on concurrent use.

Making full use of this will require a way to filter platforms more
specifically. I'm thinking of what this would look like as an API.
I'm thinking that we either have an extra iterate function that takes a
callback on the platform, or we just provide a helper to find all the
devices that can run a given image. Maybe both?

Fixes: #159636

Full diff: https://github.com/llvm/llvm-project/pull/163272.diff

2 Files Affected:

(modified) offload/liboffload/API/Common.td (+2-1)
(modified) offload/liboffload/src/OffloadImpl.cpp (+86-38)

diff --git a/offload/liboffload/API/Common.td b/offload/liboffload/API/Common.td
index ac27d85b6c964..b47223612479a 100644
--- a/offload/liboffload/API/Common.td
+++ b/offload/liboffload/API/Common.td
@@ -140,9 +140,10 @@ def ol_dimensions_t : Struct {
 }
 
 def olInit : Function {
-  let desc = "Perform initialization of the Offload library and plugins";
+  let desc = "Perform initialization of the Offload library";
   let details = [
     "This must be the first API call made by a user of the Offload library",
+    "The underlying platforms are lazily initialized on their first use"
     "Each call will increment an internal reference count that is decremented by `olShutDown`"
   ];
   let params = [];
diff --git a/offload/liboffload/src/OffloadImpl.cpp b/offload/liboffload/src/OffloadImpl.cpp
index c549ae04361d0..481965249b2eb 100644
--- a/offload/liboffload/src/OffloadImpl.cpp
+++ b/offload/liboffload/src/OffloadImpl.cpp
@@ -42,17 +42,40 @@ using namespace error;
 struct ol_platform_impl_t {
   ol_platform_impl_t(std::unique_ptr<GenericPluginTy> Plugin,
                      ol_platform_backend_t BackendType)
-      : Plugin(std::move(Plugin)), BackendType(BackendType) {}
-  std::unique_ptr<GenericPluginTy> Plugin;
-  llvm::SmallVector<std::unique_ptr<ol_device_impl_t>> Devices;
+      : BackendType(BackendType), Plugin(std::move(Plugin)) {}
   ol_platform_backend_t BackendType;
 
+  /// Get the plugin, lazily initializing it if necessary.
+  llvm::Expected<GenericPluginTy *> getPlugin() {
+    if (llvm::Error Err = init())
+      return Err;
+    return Plugin.get();
+  }
+
+  /// Get the device list, lazily initializing it if necessary.
+  llvm::Expected<llvm::SmallVector<std::unique_ptr<ol_device_impl_t>> &>
+  getDevices() {
+    if (llvm::Error Err = init())
+      return Err;
+    return Devices;
+  }
+
   /// Complete all pending work for this platform and perform any needed
   /// cleanup.
   ///
   /// After calling this function, no liboffload functions should be called with
   /// this platform handle.
   llvm::Error destroy();
+
+  /// Initialize the associated plugin and devices.
+  llvm::Error init();
+
+  /// Direct access to the plugin, may be uninitialized if accessed here.
+  std::unique_ptr<GenericPluginTy> Plugin;
+
+private:
+  std::once_flag Initialized;
+  llvm::SmallVector<std::unique_ptr<ol_device_impl_t>> Devices;
 };
 
 // Handle type definitions. Ideally these would be 1:1 with the plugins, but
@@ -130,6 +153,39 @@ llvm::Error ol_platform_impl_t::destroy() {
   return Result;
 }
 
+llvm::Error ol_platform_impl_t::init() {
+  std::unique_ptr<llvm::Error> Storage;
+
+  // This can be called concurrently, make sure we only do the actual
+  // initialization once.
+  std::call_once(Initialized, [&]() {
+    // FIXME: Need better handling for the host platform.
+    if (!Plugin)
+      return;
+
+    llvm::Error Err = Plugin->init();
+    if (Err) {
+      Storage = std::make_unique<llvm::Error>(std::move(Err));
+      return;
+    }
+
+    for (auto DevNum = 0; DevNum < Plugin->number_of_devices(); DevNum++) {
+      if (Plugin->init_device(DevNum) == OFFLOAD_SUCCESS) {
+        auto Device = &Plugin->getDevice(DevNum);
+        auto Info = Device->obtainInfoImpl();
+        if (llvm::Error Err = Info.takeError()) {
+          Storage = std::make_unique<llvm::Error>(std::move(Err));
+          return;
+        }
+        Devices.emplace_back(std::make_unique<ol_device_impl_t>(
+            DevNum, Device, *this, std::move(*Info)));
+      }
+    }
+  });
+
+  return Storage ? std::move(*Storage) : llvm::Error::success();
+}
+
 struct ol_queue_impl_t {
   ol_queue_impl_t(__tgt_async_info *AsyncInfo, ol_device_handle_t Device)
       : AsyncInfo(AsyncInfo), Device(Device), Id(IdCounter++) {}
@@ -209,13 +265,9 @@ struct OffloadContext {
   // key in AllocInfoMap
   llvm::SmallVector<void *> AllocBases{};
   SmallVector<std::unique_ptr<ol_platform_impl_t>, 4> Platforms{};
+  ol_device_handle_t HostDevice;
   size_t RefCount;
 
-  ol_device_handle_t HostDevice() {
-    // The host platform is always inserted last
-    return Platforms.back()->Devices[0].get();
-  }
-
   static OffloadContext &get() {
     assert(OffloadContextVal);
     return *OffloadContextVal;
@@ -259,28 +311,16 @@ Error initPlugins(OffloadContext &Context) {
   } while (false);
 #include "Shared/Targets.def"
 
-  // Preemptively initialize all devices in the plugin
-  for (auto &Platform : Context.Platforms) {
-    auto Err = Platform->Plugin->init();
-    [[maybe_unused]] std::string InfoMsg = toString(std::move(Err));
-    for (auto DevNum = 0; DevNum < Platform->Plugin->number_of_devices();
-         DevNum++) {
-      if (Platform->Plugin->init_device(DevNum) == OFFLOAD_SUCCESS) {
-        auto Device = &Platform->Plugin->getDevice(DevNum);
-        auto Info = Device->obtainInfoImpl();
-        if (auto Err = Info.takeError())
-          return Err;
-        Platform->Devices.emplace_back(std::make_unique<ol_device_impl_t>(
-            DevNum, Device, *Platform, std::move(*Info)));
-      }
-    }
-  }
-
   // Add the special host device
   auto &HostPlatform = Context.Platforms.emplace_back(
       std::make_unique<ol_platform_impl_t>(nullptr, OL_PLATFORM_BACKEND_HOST));
-  HostPlatform->Devices.emplace_back(std::make_unique<ol_device_impl_t>(
-      -1, nullptr, *HostPlatform, InfoTreeNode{}));
+  auto DevicesOrErr = HostPlatform->getDevices();
+  if (!DevicesOrErr)
+    return DevicesOrErr.takeError();
+  Context.HostDevice = DevicesOrErr
+                           ->emplace_back(std::make_unique<ol_device_impl_t>(
+                               -1, nullptr, *HostPlatform, InfoTreeNode{}))
+                           .get();
 
   Context.TracingEnabled = std::getenv("OFFLOAD_TRACE");
   Context.ValidationEnabled = !std::getenv("OFFLOAD_DISABLE_VALIDATION");
@@ -315,12 +355,12 @@ Error olShutDown_impl() {
   llvm::Error Result = Error::success();
   auto *OldContext = OffloadContextVal.exchange(nullptr);
 
-  for (auto &P : OldContext->Platforms) {
+  for (auto &Platform : OldContext->Platforms) {
     // Host plugin is nullptr and has no deinit
-    if (!P->Plugin || !P->Plugin->is_initialized())
+    if (!Platform->Plugin || !Platform->Plugin->is_initialized())
       continue;
 
-    if (auto Res = P->destroy())
+    if (auto Res = Platform->destroy())
       Result = llvm::joinErrors(std::move(Result), std::move(Res));
   }
 
@@ -334,9 +374,14 @@ Error olGetPlatformInfoImplDetail(ol_platform_handle_t Platform,
   InfoWriter Info(PropSize, PropValue, PropSizeRet);
   bool IsHost = Platform->BackendType == OL_PLATFORM_BACKEND_HOST;
 
+  auto PluginOrErr = Platform->getPlugin();
+  if (!PluginOrErr)
+    return PluginOrErr.takeError();
+  GenericPluginTy *Plugin = *PluginOrErr;
+
   switch (PropName) {
   case OL_PLATFORM_INFO_NAME:
-    return Info.writeString(IsHost ? "Host" : Platform->Plugin->getName());
+    return Info.writeString(IsHost ? "Host" : Plugin->getName());
   case OL_PLATFORM_INFO_VENDOR_NAME:
     // TODO: Implement this
     return Info.writeString("Unknown platform vendor");
@@ -373,7 +418,7 @@ Error olGetPlatformInfoSize_impl(ol_platform_handle_t Platform,
 Error olGetDeviceInfoImplDetail(ol_device_handle_t Device,
                                 ol_device_info_t PropName, size_t PropSize,
                                 void *PropValue, size_t *PropSizeRet) {
-  assert(Device != OffloadContext::get().HostDevice());
+  assert(Device != OffloadContext::get().HostDevice);
   InfoWriter Info(PropSize, PropValue, PropSizeRet);
 
   auto makeError = [&](ErrorCode Code, StringRef Err) {
@@ -511,7 +556,7 @@ Error olGetDeviceInfoImplDetail(ol_device_handle_t Device,
 Error olGetDeviceInfoImplDetailHost(ol_device_handle_t Device,
                                     ol_device_info_t PropName, size_t PropSize,
                                     void *PropValue, size_t *PropSizeRet) {
-  assert(Device == OffloadContext::get().HostDevice());
+  assert(Device == OffloadContext::get().HostDevice);
   InfoWriter Info(PropSize, PropValue, PropSizeRet);
 
   constexpr auto uint32_max = std::numeric_limits<uint32_t>::max();
@@ -579,7 +624,7 @@ Error olGetDeviceInfoImplDetailHost(ol_device_handle_t Device,
 
 Error olGetDeviceInfo_impl(ol_device_handle_t Device, ol_device_info_t PropName,
                            size_t PropSize, void *PropValue) {
-  if (Device == OffloadContext::get().HostDevice())
+  if (Device == OffloadContext::get().HostDevice)
     return olGetDeviceInfoImplDetailHost(Device, PropName, PropSize, PropValue,
                                          nullptr);
   return olGetDeviceInfoImplDetail(Device, PropName, PropSize, PropValue,
@@ -588,7 +633,7 @@ Error olGetDeviceInfo_impl(ol_device_handle_t Device, ol_device_info_t PropName,
 
 Error olGetDeviceInfoSize_impl(ol_device_handle_t Device,
                                ol_device_info_t PropName, size_t *PropSizeRet) {
-  if (Device == OffloadContext::get().HostDevice())
+  if (Device == OffloadContext::get().HostDevice)
     return olGetDeviceInfoImplDetailHost(Device, PropName, 0, nullptr,
                                          PropSizeRet);
   return olGetDeviceInfoImplDetail(Device, PropName, 0, nullptr, PropSizeRet);
@@ -596,9 +641,12 @@ Error olGetDeviceInfoSize_impl(ol_device_handle_t Device,
 
 Error olIterateDevices_impl(ol_device_iterate_cb_t Callback, void *UserData) {
   for (auto &Platform : OffloadContext::get().Platforms) {
-    for (auto &Device : Platform->Devices) {
+    auto DevicesOrErr = Platform->getDevices();
+    if (!DevicesOrErr)
+      return DevicesOrErr.takeError();
+    for (auto &Device : *DevicesOrErr) {
       if (!Callback(Device.get(), UserData)) {
-        break;
+        return Error::success();
       }
     }
   }
@@ -949,7 +997,7 @@ Error olCreateEvent_impl(ol_queue_handle_t Queue, ol_event_handle_t *EventOut) {
 Error olMemcpy_impl(ol_queue_handle_t Queue, void *DstPtr,
                     ol_device_handle_t DstDevice, const void *SrcPtr,
                     ol_device_handle_t SrcDevice, size_t Size) {
-  auto Host = OffloadContext::get().HostDevice();
+  auto Host = OffloadContext::get().HostDevice;
   if (DstDevice == Host && SrcDevice == Host) {
     if (!Queue) {
       std::memcpy(DstPtr, SrcPtr, Size);

offload/liboffload/src/OffloadImpl.cpp

kevinsala · 2025-10-13T22:23:37Z

offload/liboffload/src/OffloadImpl.cpp

 }

+llvm::Error ol_platform_impl_t::init() {
+  std::unique_ptr<llvm::Error> Storage;


Is it needed to use dynamic memory?

As far as I'm aware, if you pass in a pointer to local you'd get errors for that local not being handled. So you need something nullable. It will only dynamically allocate in the case of an error, so I don't think performance is an issue there.

adurang

lgtm

Summary: The Offloading library wraps around the underlying plugins. The problem is that we currently initialize all plugins we find, even if they are not needed for the program. This is very expensive for trivial uses, as fully heterogenous usage is quite rare. In practice this means that you will always pay a 200 ms penalty for having CUDA installed. This patch changes the behavior to provide accessors into the plugins and devices that allows them to be initialized lazily. We use a once_flag, this should properly take a fast-path check while still blocking on concurrent use. Making full use of this will require a way to filter platforms more specifically. I'm thinking of what this would look like as an API. I'm thinking that we either have an extra iterate function that takes a callback on the platform, or we just provide a helper to find all the devices that can run a given image. Maybe both? Fixes: llvm#159636

nico · 2025-10-14T15:24:20Z

Any chance that this is causing http://45.33.8.238/macm1/115570/step_10.txt ?

Am I holding something wrong?

(Linux/x64 seems to be happy.)

jhuber6 · 2025-10-14T15:35:52Z

Any chance that this is causing http://45.33.8.238/macm1/115570/step_10.txt ?

Am I holding something wrong?

(Linux/x64 seems to be happy.)

Completely unrelated

jhuber6 · 2025-10-14T17:17:57Z

Unfortunately I might need to revert this. There's some extremely weird interactions here when you destruct the runtime library in a global destructor like we do in the tests. Essentially, if you initialize the platform runtime after main has run but then call destroy it in a global destructor, some global resources may have already been freed by the time the global constructor runs.

This issue happens because if you initialize it in a global constructor it will do it lazily and the trigger the actual initialization in main. Because this is interacting with the underlying platform (and I know that CUDA does a lot of weird teardown magic we have no control over.) I can't imagine an easy fix to this. The only thing I can think of is if we hack a global constructor into the runtime to detect if we're firing before main, but that would be quite awful.

Unfortunately this puts me in an awkward place, because I was hoping for this to be a minimally invasive solution to avoiding unnecessary overhead... Honestly might just need to come down to providing a separate olInit variant that takes a list of desired platforms.

…163272) Summary: This causes issues with CUDA's teardown order when the init is separated from the total init scope.

) Summary: The Offloading library wraps around the underlying plugins. The problem is that we currently initialize all plugins we find, even if they are not needed for the program. This is very expensive for trivial uses, as fully heterogenous usage is quite rare. In practice this means that you will always pay a 200 ms penalty for having CUDA installed. This patch changes the behavior to provide accessors into the plugins and devices that allows them to be initialized lazily. We use a once_flag, this should properly take a fast-path check while still blocking on concurrent use. Making full use of this will require a way to filter platforms more specifically. I'm thinking of what this would look like as an API. I'm thinking that we either have an extra iterate function that takes a callback on the platform, or we just provide a helper to find all the devices that can run a given image. Maybe both? Fixes: llvm#159636

…lvm#163272) Summary: This causes issues with CUDA's teardown order when the init is separated from the total init scope.

jhuber6 requested review from RossBrunton, adurang, callumfare, jplehr, sarnex and shiltian October 13, 2025 21:52

llvmbot added the offload label Oct 13, 2025

kevinsala reviewed Oct 13, 2025

View reviewed changes

offload/liboffload/src/OffloadImpl.cpp Outdated Show resolved Hide resolved

kevinsala reviewed Oct 13, 2025

View reviewed changes

jplehr requested a review from Kewen12 October 14, 2025 07:56

adurang approved these changes Oct 14, 2025

View reviewed changes

jhuber6 added 2 commits October 14, 2025 07:46

comments

687ed9f

jhuber6 force-pushed the lazy_plugins branch from 7d66428 to 687ed9f Compare October 14, 2025 13:09

jhuber6 merged commit 4a35c4d into llvm:main Oct 14, 2025
10 checks passed

jhuber6 added a commit that referenced this pull request Oct 14, 2025

Revert "[Offload] Lazily initialize platforms in the Offloading API" (#…

227bc57

…163272) Summary: This causes issues with CUDA's teardown order when the init is separated from the total init scope.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Offload] Lazily initialize platforms in the Offloading API #163272

[Offload] Lazily initialize platforms in the Offloading API #163272

Uh oh!

jhuber6 commented Oct 13, 2025

Uh oh!

llvmbot commented Oct 13, 2025

Uh oh!

Uh oh!

kevinsala Oct 13, 2025

Uh oh!

jhuber6 Oct 13, 2025

Uh oh!

adurang left a comment

Uh oh!

Uh oh!

nico commented Oct 14, 2025 •

edited

Loading

Uh oh!

jhuber6 commented Oct 14, 2025

Uh oh!

jhuber6 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[Offload] Lazily initialize platforms in the Offloading API #163272

[Offload] Lazily initialize platforms in the Offloading API #163272

Uh oh!

Conversation

jhuber6 commented Oct 13, 2025

Uh oh!

llvmbot commented Oct 13, 2025

Uh oh!

Uh oh!

kevinsala Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

jhuber6 Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

adurang left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nico commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Oct 14, 2025

Uh oh!

jhuber6 commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nico commented Oct 14, 2025 •

edited

Loading