[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739

jhuber6 · 2025-10-09T21:28:10Z

Summary:
This patch rewrites the llvm-gpu-loader utility to use the LLVMOffload
interface. This heavily simplifies it while re-using the already
existing support. Another benefit is that I can now easily do this
dynamically so we can always build this utility without needing to find
non-standard packages.

One issue is mentioned in
#159636 where this will now
take extra time if you have both installed on the same machine. This is
just slightly annoying since most people don't have both CUDA and ROCm
at the same time so I don't consider it a blocker. I will work later to
address it.

Slightly unfortunate environment variable usage, I will also expose that
better in the future.

Fixes: #132890

llvmbot · 2025-10-09T21:28:44Z

@llvm/pr-subscribers-libc
@llvm/pr-subscribers-backend-nvptx

@llvm/pr-subscribers-backend-amdgpu

Author: Joseph Huber (jhuber6)

Changes

Summary:
This patch rewrites the llvm-gpu-loader utility to use the LLVMOffload
interface. This heavily simplifies it while re-using the already
existing support. Another benefit is that I can now easily do this
dynamically so we can always build this utility without needing to find
non-standard packages.

One issue is mentioned in
#159636 where this will now
take extra time if you have both installed on the same machine. This is
just slightly annoying since most people don't have both CUDA and ROCm
at the same time so I don't consider it a blocker. I will work later to
address it.

Slightly unfortunate environment variable usage, I will also expose that
better in the future.

Fixes: #132890

Patch is 71.88 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/162739.diff

10 Files Affected:

(modified) libc/cmake/modules/LLVMLibCTestRules.cmake (+3-3)
(modified) libc/startup/gpu/amdgpu/start.cpp (+6-30)
(modified) libc/startup/gpu/nvptx/start.cpp (+3-37)
(modified) llvm/tools/CMakeLists.txt (-4)
(modified) llvm/tools/llvm-gpu-loader/CMakeLists.txt (-34)
(removed) llvm/tools/llvm-gpu-loader/amdhsa.cpp (-594)
(modified) llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp (+217-83)
(modified) llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h (+164-95)
(removed) llvm/tools/llvm-gpu-loader/nvptx.cpp (-367)
(removed) llvm/tools/llvm-gpu-loader/server.h (-55)

diff --git a/libc/cmake/modules/LLVMLibCTestRules.cmake b/libc/cmake/modules/LLVMLibCTestRules.cmake
index 19da0ad29cd84..dfc5ab904fe9d 100644
--- a/libc/cmake/modules/LLVMLibCTestRules.cmake
+++ b/libc/cmake/modules/LLVMLibCTestRules.cmake
@@ -87,7 +87,6 @@ function(_get_hermetic_test_compile_options output_var)
          -mcode-object-version=${LIBC_GPU_CODE_OBJECT_VERSION})
   elseif(LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
     list(APPEND compile_options
-         "SHELL:-mllvm -nvptx-emit-init-fini-kernel=false"
          -Wno-multi-gpu --cuda-path=${LIBC_CUDA_ROOT}
          -nogpulib -march=${LIBC_GPU_TARGET_ARCHITECTURE} -fno-use-cxa-atexit)
   endif()
@@ -637,6 +636,7 @@ function(add_integration_test test_name)
   # makes `add_custom_target` construct the correct command and execute it.
   set(test_cmd
       ${INTEGRATION_TEST_ENV}
+      $<$<BOOL:${LIBC_TARGET_ARCHITECTURE_IS_NVPTX}>:LIBOMPTARGET_STACK_SIZE=3072>
       $<$<BOOL:${LIBC_TARGET_OS_IS_GPU}>:${gpu_loader_exe}>
       ${CMAKE_CROSSCOMPILING_EMULATOR}
       ${INTEGRATION_TEST_LOADER_ARGS}
@@ -790,8 +790,7 @@ function(add_libc_hermetic test_name)
   if(LIBC_TARGET_ARCHITECTURE_IS_AMDGPU)
     target_link_options(${fq_build_target_name} PRIVATE
       ${LIBC_COMPILE_OPTIONS_DEFAULT} -Wno-multi-gpu
-      -mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto
-      "-Wl,-mllvm,-amdgpu-lower-global-ctor-dtor=0" -nostdlib -static
+      -mcpu=${LIBC_GPU_TARGET_ARCHITECTURE} -flto -nostdlib -static
       "-Wl,-mllvm,-amdhsa-code-object-version=${LIBC_GPU_CODE_OBJECT_VERSION}")
   elseif(LIBC_TARGET_ARCHITECTURE_IS_NVPTX)
     target_link_options(${fq_build_target_name} PRIVATE
@@ -859,6 +858,7 @@ function(add_libc_hermetic test_name)
       string(REPLACE " " ";" test_cmd "${test_cmd_parsed}")
     else()
       set(test_cmd ${HERMETIC_TEST_ENV}
+        $<$<BOOL:${LIBC_TARGET_ARCHITECTURE_IS_NVPTX}>:LIBOMPTARGET_STACK_SIZE=3072>
         $<$<BOOL:${LIBC_TARGET_OS_IS_GPU}>:${gpu_loader_exe}> ${CMAKE_CROSSCOMPILING_EMULATOR} ${HERMETIC_TEST_LOADER_ARGS}
         $<TARGET_FILE:${fq_build_target_name}> ${HERMETIC_TEST_ARGS})
     endif()
diff --git a/libc/startup/gpu/amdgpu/start.cpp b/libc/startup/gpu/amdgpu/start.cpp
index 48f095d924931..e876629e6d0e1 100644
--- a/libc/startup/gpu/amdgpu/start.cpp
+++ b/libc/startup/gpu/amdgpu/start.cpp
@@ -13,6 +13,9 @@
 #include "src/stdlib/atexit.h"
 #include "src/stdlib/exit.h"
 
+// TODO: Merge this and the NVPTX start files once the common `device_kernel`
+// attribute correctly implies `amdgpu_kernel`.
+
 extern "C" int main(int argc, char **argv, char **envp);
 extern "C" void __cxa_finalize(void *dso);
 
@@ -21,45 +24,18 @@ namespace LIBC_NAMESPACE_DECL {
 // FIXME: Factor this out into common logic so we don't need to stub it here.
 void teardown_main_tls() {}
 
-// FIXME: Touch this symbol to force this to be linked in statically.
-volatile void *dummy = &LIBC_NAMESPACE::rpc::client;
-
 DataEnvironment app;
 
-extern "C" uintptr_t __init_array_start[];
-extern "C" uintptr_t __init_array_end[];
-extern "C" uintptr_t __fini_array_start[];
-extern "C" uintptr_t __fini_array_end[];
-
-using InitCallback = void(int, char **, char **);
-using FiniCallback = void(void);
-
-static void call_init_array_callbacks(int argc, char **argv, char **env) {
-  size_t init_array_size = __init_array_end - __init_array_start;
-  for (size_t i = 0; i < init_array_size; ++i)
-    reinterpret_cast<InitCallback *>(__init_array_start[i])(argc, argv, env);
-}
-
-static void call_fini_array_callbacks() {
-  size_t fini_array_size = __fini_array_end - __fini_array_start;
-  for (size_t i = fini_array_size; i > 0; --i)
-    reinterpret_cast<FiniCallback *>(__fini_array_start[i - 1])();
-}
-
 } // namespace LIBC_NAMESPACE_DECL
 
 extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel,
              clang::amdgpu_flat_work_group_size(1, 1),
              clang::amdgpu_max_num_work_groups(1)]] void
-_begin(int argc, char **argv, char **env) {
+_begin(int, char **, char **env) {
+  // The LLVM offloading runtime will automatically call any present global
+  // constructors and destructors so we defer that handling.
   __atomic_store_n(&LIBC_NAMESPACE::app.env_ptr,
                    reinterpret_cast<uintptr_t *>(env), __ATOMIC_RELAXED);
-  // We want the fini array callbacks to be run after other atexit
-  // callbacks are run. So, we register them before running the init
-  // array callbacks as they can potentially register their own atexit
-  // callbacks.
-  LIBC_NAMESPACE::atexit(&LIBC_NAMESPACE::call_fini_array_callbacks);
-  LIBC_NAMESPACE::call_init_array_callbacks(argc, argv, env);
 }
 
 extern "C" [[gnu::visibility("protected"), clang::amdgpu_kernel]] void
diff --git a/libc/startup/gpu/nvptx/start.cpp b/libc/startup/gpu/nvptx/start.cpp
index ce8f5bbb6d4eb..822930e22bc26 100644
--- a/libc/startup/gpu/nvptx/start.cpp
+++ b/libc/startup/gpu/nvptx/start.cpp
@@ -23,48 +23,14 @@ DataEnvironment app;
 // FIXME: Factor this out into common logic so we don't need to stub it here.
 void teardown_main_tls() {}
 
-// FIXME: Touch this symbol to force this to be linked in statically.
-volatile void *dummy = &LIBC_NAMESPACE::rpc::client;
-
-extern "C" {
-// Nvidia's 'nvlink' linker does not provide these symbols. We instead need
-// to manually create them and update the globals in the loader implememtation.
-uintptr_t *__init_array_start [[gnu::visibility("protected")]];
-uintptr_t *__init_array_end [[gnu::visibility("protected")]];
-uintptr_t *__fini_array_start [[gnu::visibility("protected")]];
-uintptr_t *__fini_array_end [[gnu::visibility("protected")]];
-}
-
-// Nvidia requires that the signature of the function pointers match. This means
-// we cannot support the extended constructor arguments.
-using InitCallback = void(void);
-using FiniCallback = void(void);
-
-static void call_init_array_callbacks(int, char **, char **) {
-  size_t init_array_size = __init_array_end - __init_array_start;
-  for (size_t i = 0; i < init_array_size; ++i)
-    reinterpret_cast<InitCallback *>(__init_array_start[i])();
-}
-
-static void call_fini_array_callbacks() {
-  size_t fini_array_size = __fini_array_end - __fini_array_start;
-  for (size_t i = fini_array_size; i > 0; --i)
-    reinterpret_cast<FiniCallback *>(__fini_array_start[i - 1])();
-}
-
 } // namespace LIBC_NAMESPACE_DECL
 
 extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
-_begin(int argc, char **argv, char **env) {
+_begin(int, char **, char **env) {
+  // The LLVM offloading runtime will automatically call any present global
+  // constructors and destructors so we defer that handling.
   __atomic_store_n(&LIBC_NAMESPACE::app.env_ptr,
                    reinterpret_cast<uintptr_t *>(env), __ATOMIC_RELAXED);
-
-  // We want the fini array callbacks to be run after other atexit
-  // callbacks are run. So, we register them before running the init
-  // array callbacks as they can potentially register their own atexit
-  // callbacks.
-  LIBC_NAMESPACE::atexit(&LIBC_NAMESPACE::call_fini_array_callbacks);
-  LIBC_NAMESPACE::call_init_array_callbacks(argc, argv, env);
 }
 
 extern "C" [[gnu::visibility("protected"), clang::nvptx_kernel]] void
diff --git a/llvm/tools/CMakeLists.txt b/llvm/tools/CMakeLists.txt
index 729797aa43f0b..b9c5a79849ec8 100644
--- a/llvm/tools/CMakeLists.txt
+++ b/llvm/tools/CMakeLists.txt
@@ -9,10 +9,6 @@
 # traversing each directory.
 create_llvm_tool_options()
 
-if(NOT LLVM_COMPILER_IS_GCC_COMPATIBLE OR NOT LLVM_LIBC_GPU_BUILD)
-  set(LLVM_TOOL_LLVM_GPU_LOADER_BUILD OFF)
-endif()
-
 if(NOT LLVM_BUILD_LLVM_DYLIB AND NOT LLVM_BUILD_LLVM_C_DYLIB)
   set(LLVM_TOOL_LLVM_SHLIB_BUILD Off)
 endif()
diff --git a/llvm/tools/llvm-gpu-loader/CMakeLists.txt b/llvm/tools/llvm-gpu-loader/CMakeLists.txt
index b35a702476ada..de276635e2713 100644
--- a/llvm/tools/llvm-gpu-loader/CMakeLists.txt
+++ b/llvm/tools/llvm-gpu-loader/CMakeLists.txt
@@ -1,46 +1,12 @@
 set(LLVM_LINK_COMPONENTS
-  BinaryFormat
-  Object
   Option
   Support
-  FrontendOffloading
   TargetParser
 )
 
 add_llvm_tool(llvm-gpu-loader
   llvm-gpu-loader.cpp
 
-  # TODO: We intentionally split this currently due to statically linking the
-  #       GPU runtimes. Dynamically load the dependencies, possibly using the
-  #       LLVM offloading API when it is complete.
-  PARTIAL_SOURCES_INTENDED
-
   DEPENDS
   intrinsics_gen
 )
-
-# Locate the RPC server handling interface.
-include(FindLibcCommonUtils)
-target_link_libraries(llvm-gpu-loader PUBLIC llvm-libc-common-utilities)
-
-# Check for HSA support for targeting AMD GPUs.
-find_package(hsa-runtime64 QUIET 1.2.0 HINTS ${CMAKE_INSTALL_PREFIX} PATHS /opt/rocm)
-if(hsa-runtime64_FOUND)
-  target_sources(llvm-gpu-loader PRIVATE amdhsa.cpp)
-  target_compile_definitions(llvm-gpu-loader PRIVATE AMDHSA_SUPPORT)
-  target_link_libraries(llvm-gpu-loader PRIVATE hsa-runtime64::hsa-runtime64)
-
-  # Compatibility with the old amdhsa-loader name.
-  add_llvm_tool_symlink(amdhsa-loader llvm-gpu-loader)
-endif()
-
-# Check for CUDA support for targeting NVIDIA GPUs.
-find_package(CUDAToolkit 11.2 QUIET)
-if(CUDAToolkit_FOUND)
-  target_sources(llvm-gpu-loader PRIVATE nvptx.cpp)
-  target_compile_definitions(llvm-gpu-loader PRIVATE NVPTX_SUPPORT)
-  target_link_libraries(llvm-gpu-loader PRIVATE CUDA::cuda_driver)
-
-  # Compatibility with the old nvptx-loader name.
-  add_llvm_tool_symlink(nvptx-loader llvm-gpu-loader)
-endif()
diff --git a/llvm/tools/llvm-gpu-loader/amdhsa.cpp b/llvm/tools/llvm-gpu-loader/amdhsa.cpp
deleted file mode 100644
index 5715058d8cfac..0000000000000
--- a/llvm/tools/llvm-gpu-loader/amdhsa.cpp
+++ /dev/null
@@ -1,594 +0,0 @@
-//===-- Loader Implementation for AMDHSA devices --------------------------===//
-//
-// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
-// See https://llvm.org/LICENSE.txt for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-//===----------------------------------------------------------------------===//
-//
-// This file impelements a simple loader to run images supporting the AMDHSA
-// architecture. The file launches the '_start' kernel which should be provided
-// by the device application start code and call ultimately call the 'main'
-// function.
-//
-//===----------------------------------------------------------------------===//
-
-#include "llvm-gpu-loader.h"
-#include "server.h"
-
-#include "hsa/hsa.h"
-#include "hsa/hsa_ext_amd.h"
-
-#include "llvm/Frontend/Offloading/Utility.h"
-
-#include <atomic>
-#include <cstdio>
-#include <cstdlib>
-#include <cstring>
-#include <thread>
-#include <tuple>
-#include <utility>
-
-// The implicit arguments of COV5 AMDGPU kernels.
-struct implicit_args_t {
-  uint32_t grid_size_x;
-  uint32_t grid_size_y;
-  uint32_t grid_size_z;
-  uint16_t workgroup_size_x;
-  uint16_t workgroup_size_y;
-  uint16_t workgroup_size_z;
-  uint8_t Unused0[46];
-  uint16_t grid_dims;
-  uint8_t Unused1[190];
-};
-
-/// Print the error code and exit if \p code indicates an error.
-static void handle_error_impl(const char *file, int32_t line,
-                              hsa_status_t code) {
-  if (code == HSA_STATUS_SUCCESS || code == HSA_STATUS_INFO_BREAK)
-    return;
-
-  const char *desc;
-  if (hsa_status_string(code, &desc) != HSA_STATUS_SUCCESS)
-    desc = "Unknown error";
-  fprintf(stderr, "%s:%d:0: Error: %s\n", file, line, desc);
-  exit(EXIT_FAILURE);
-}
-
-/// Generic interface for iterating using the HSA callbacks.
-template <typename elem_ty, typename func_ty, typename callback_ty>
-hsa_status_t iterate(func_ty func, callback_ty cb) {
-  auto l = [](elem_ty elem, void *data) -> hsa_status_t {
-    callback_ty *unwrapped = static_cast<callback_ty *>(data);
-    return (*unwrapped)(elem);
-  };
-  return func(l, static_cast<void *>(&cb));
-}
-
-/// Generic interface for iterating using the HSA callbacks.
-template <typename elem_ty, typename func_ty, typename func_arg_ty,
-          typename callback_ty>
-hsa_status_t iterate(func_ty func, func_arg_ty func_arg, callback_ty cb) {
-  auto l = [](elem_ty elem, void *data) -> hsa_status_t {
-    callback_ty *unwrapped = static_cast<callback_ty *>(data);
-    return (*unwrapped)(elem);
-  };
-  return func(func_arg, l, static_cast<void *>(&cb));
-}
-
-/// Iterate through all availible agents.
-template <typename callback_ty>
-hsa_status_t iterate_agents(callback_ty callback) {
-  return iterate<hsa_agent_t>(hsa_iterate_agents, callback);
-}
-
-/// Iterate through all availible memory pools.
-template <typename callback_ty>
-hsa_status_t iterate_agent_memory_pools(hsa_agent_t agent, callback_ty cb) {
-  return iterate<hsa_amd_memory_pool_t>(hsa_amd_agent_iterate_memory_pools,
-                                        agent, cb);
-}
-
-template <hsa_device_type_t flag>
-hsa_status_t get_agent(hsa_agent_t *output_agent) {
-  // Find the first agent with a matching device type.
-  auto cb = [&](hsa_agent_t hsa_agent) -> hsa_status_t {
-    hsa_device_type_t type;
-    hsa_status_t status =
-        hsa_agent_get_info(hsa_agent, HSA_AGENT_INFO_DEVICE, &type);
-    if (status != HSA_STATUS_SUCCESS)
-      return status;
-
-    if (type == flag) {
-      // Ensure that a GPU agent supports kernel dispatch packets.
-      if (type == HSA_DEVICE_TYPE_GPU) {
-        hsa_agent_feature_t features;
-        status =
-            hsa_agent_get_info(hsa_agent, HSA_AGENT_INFO_FEATURE, &features);
-        if (status != HSA_STATUS_SUCCESS)
-          return status;
-        if (features & HSA_AGENT_FEATURE_KERNEL_DISPATCH)
-          *output_agent = hsa_agent;
-      } else {
-        *output_agent = hsa_agent;
-      }
-      return HSA_STATUS_INFO_BREAK;
-    }
-    return HSA_STATUS_SUCCESS;
-  };
-
-  return iterate_agents(cb);
-}
-
-void print_kernel_resources(const char *kernel_name) {
-  fprintf(stderr, "Kernel resources on AMDGPU is not supported yet.\n");
-}
-
-/// Retrieve a global memory pool with a \p flag from the agent.
-template <hsa_amd_memory_pool_global_flag_t flag>
-hsa_status_t get_agent_memory_pool(hsa_agent_t agent,
-                                   hsa_amd_memory_pool_t *output_pool) {
-  auto cb = [&](hsa_amd_memory_pool_t memory_pool) {
-    uint32_t flags;
-    hsa_amd_segment_t segment;
-    if (auto err = hsa_amd_memory_pool_get_info(
-            memory_pool, HSA_AMD_MEMORY_POOL_INFO_SEGMENT, &segment))
-      return err;
-    if (auto err = hsa_amd_memory_pool_get_info(
-            memory_pool, HSA_AMD_MEMORY_POOL_INFO_GLOBAL_FLAGS, &flags))
-      return err;
-
-    if (segment != HSA_AMD_SEGMENT_GLOBAL)
-      return HSA_STATUS_SUCCESS;
-
-    if (flags & flag)
-      *output_pool = memory_pool;
-
-    return HSA_STATUS_SUCCESS;
-  };
-  return iterate_agent_memory_pools(agent, cb);
-}
-
-template <typename args_t>
-hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
-                           hsa_amd_memory_pool_t kernargs_pool,
-                           hsa_amd_memory_pool_t coarsegrained_pool,
-                           hsa_queue_t *queue, rpc::Server &server,
-                           const LaunchParameters &params,
-                           const char *kernel_name, args_t kernel_args,
-                           uint32_t wavefront_size, bool print_resource_usage) {
-  // Look up the kernel in the loaded executable.
-  hsa_executable_symbol_t symbol;
-  if (hsa_status_t err = hsa_executable_get_symbol_by_name(
-          executable, kernel_name, &dev_agent, &symbol))
-    return err;
-
-  // Retrieve different properties of the kernel symbol used for launch.
-  uint64_t kernel;
-  uint32_t args_size;
-  uint32_t group_size;
-  uint32_t private_size;
-  bool dynamic_stack;
-
-  std::pair<hsa_executable_symbol_info_t, void *> symbol_infos[] = {
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_OBJECT, &kernel},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_KERNARG_SEGMENT_SIZE, &args_size},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_GROUP_SEGMENT_SIZE, &group_size},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_DYNAMIC_CALLSTACK, &dynamic_stack},
-      {HSA_EXECUTABLE_SYMBOL_INFO_KERNEL_PRIVATE_SEGMENT_SIZE, &private_size}};
-
-  for (auto &[info, value] : symbol_infos)
-    if (hsa_status_t err = hsa_executable_symbol_get_info(symbol, info, value))
-      return err;
-
-  // Allocate space for the kernel arguments on the host and allow the GPU agent
-  // to access it.
-  void *args;
-  if (hsa_status_t err = hsa_amd_memory_pool_allocate(kernargs_pool, args_size,
-                                                      /*flags=*/0, &args))
-    handle_error(err);
-  hsa_amd_agents_allow_access(1, &dev_agent, nullptr, args);
-
-  // Initialize all the arguments (explicit and implicit) to zero, then set the
-  // explicit arguments to the values created above.
-  std::memset(args, 0, args_size);
-  std::memcpy(args, &kernel_args, std::is_empty_v<args_t> ? 0 : sizeof(args_t));
-
-  // Initialize the necessary implicit arguments to the proper values.
-  int dims = 1 + (params.num_blocks_y * params.num_threads_y != 1) +
-             (params.num_blocks_z * params.num_threads_z != 1);
-  implicit_args_t *implicit_args = reinterpret_cast<implicit_args_t *>(
-      reinterpret_cast<uint8_t *>(args) + sizeof(args_t));
-  implicit_args->grid_dims = dims;
-  implicit_args->grid_size_x = params.num_blocks_x;
-  implicit_args->grid_size_y = params.num_blocks_y;
-  implicit_args->grid_size_z = params.num_blocks_z;
-  implicit_args->workgroup_size_x = params.num_threads_x;
-  implicit_args->workgroup_size_y = params.num_threads_y;
-  implicit_args->workgroup_size_z = params.num_threads_z;
-
-  // Obtain a packet from the queue.
-  uint64_t packet_id = hsa_queue_add_write_index_relaxed(queue, 1);
-  while (packet_id - hsa_queue_load_read_index_scacquire(queue) >= queue->size)
-    ;
-
-  const uint32_t mask = queue->size - 1;
-  hsa_kernel_dispatch_packet_t *packet =
-      static_cast<hsa_kernel_dispatch_packet_t *>(queue->base_address) +
-      (packet_id & mask);
-
-  // Set up the packet for exeuction on the device. We currently only launch
-  // with one thread on the device, forcing the rest of the wavefront to be
-  // masked off.
-  uint16_t setup = (dims) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
-  packet->workgroup_size_x = params.num_threads_x;
-  packet->workgroup_size_y = params.num_threads_y;
-  packet->workgroup_size_z = params.num_threads_z;
-  packet->reserved0 = 0;
-  packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
-  packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
-  packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
-  packet->private_segment_size =
-      dynamic_stack ? 16 * 1024 /* 16 KB */ : private_size;
-  packet->group_segment_size = group_size;
-  packet->kernel_object = kernel;
-  packet->kernarg_address = args;
-  packet->reserved2 = 0;
-  // Create a signal to indicate when this packet has been completed.
-  if (hsa_status_t err =
-          hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
-    handle_error(err);
-
-  if (print_resource_usage)
-    print_kernel_resources(kernel_name);
-
-  // Initialize the packet header and set the doorbell signal to begin execution
-  // by the HSA runtime.
-  uint16_t header =
-      1u << HSA_PACKET_HEADER_BARRIER |
-      (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
-      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
-      (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
-  uint32_t header_word = header | (setup << 16u);
-  __atomic_store_n((uint32_t *)&packet->header, header_word, __ATOMIC_RELEASE);
-  hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);
-
-  std::atomic<bool> finished = false;
-  std::thread server_thread(
-      [](std::atomic<bool> *finished, rpc::Server *server,
-         uint32_t wavefront_size, hsa_agent_t dev_agent,
-         hsa_amd_memory_pool_t coarsegrained_pool) {
-        // Register RPC callbacks for the malloc and free functions on HSA.
-        auto malloc_handler = [&](size_t size) -> void * {
-          void *dev_ptr = nullptr;
-          if (hsa_amd_memory_pool_allocate(coarsegrained_pool, size,
-                                           /*flags=*/0, &dev_ptr))
-            dev_ptr = nullptr;
-          hsa_amd_agents_allow_access(1, &dev_agent, nullptr, dev_ptr...
[truncated]

jhuber6 · 2025-10-09T21:29:59Z

@Artem-B This may require you to change the build of the NVIDIA bot for libc. Ideally you just need to add offload to the runtimes list and make sure that libLLVMOffload.so and libcuda.so exists somewhere in the path.

Artem-B · 2025-10-09T22:07:10Z

This may require you to change the build of the NVIDIA bot for libc. Ideally you just need to add offload to the runtimes list and make sure that libLLVMOffload.so and libcuda.so exists somewhere in the path.

Are the libraries expected to exist somewhere in the search path on the build host or on the host where I run the tests?

jhuber6 · 2025-10-09T22:33:54Z

Are the libraries expected to exist somewhere in the search path on the build host or on the host where I run the tests?

We pick them out at runtime via dlopen, so they only need to exist somewhere on the machine running the test's runtime path.

sarnex

not really qualified to review this but just some basic comments

libc/cmake/modules/LLVMLibCTestRules.cmake

libc/startup/gpu/amdgpu/start.cpp

sarnex · 2025-10-10T15:56:11Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp

+  return copyArgumentVector(Envc, Envp, Device);
+}
+
+ol_device_handle_t findDevice(MemoryBufferRef Binary) {


maybe we could name this findDeviceForImage or something?

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp

sarnex · 2025-10-10T15:59:23Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef enum ol_alloc_type_t {
+  OL_ALLOC_TYPE_HOST = 0,
+  OL_ALLOC_TYPE_DEVICE = 1,
+  OL_ALLOC_TYPE_FORCE_UINT32 = 0x7fffffff


can you describe what the FORCE entries represent?

This is technically C++ so I could probably just use : uint32_t, but it's the C way to make sure the enum shows up as u32 instead of u8 or something.

Is there a particular reason we need specific underlying type for those enums?

If those are intended to be used as the stable storage types, we should use explicit bit-stable data type for that. If we do not care about specific in memory representation, then we probably don't need those large-value enums either.

I figured it's safer to make sure they match the same type size as the official header, since this is just a trimmed down copy of the full OffloadAPI.h. I could use C++ features to set it instead but figured it's less confusing like this.

If the intent is to replicate the constants of autogenerated OffloadAPI.h then it actually raises more questions.

why can't we use that header directly?

if we can't use that header directly for some reason, how do we guarantee that the enums here match enums there?

with OffloadAPI.h being generated, it's not present in the source tree, so replicating all idiosyncrasies of the generator does not really buy us anything other than more questions. IMO. We literally have nothing to be consistent with here.

If we do need to replicate parts of the OffloadAPI.h here, then there should be a very prominent comment pointing to where one can find the ground truth for these constants. Ideally we should add a note where we generate offloadAPI.h, too to remind us to change things here when the API gets updated. Bonus points for some guardrails for automatically catching the moment when this header starts to diverge, though I'm not sure how it can be done without having access to the OffloadApi.h

Perhaps we should just commit OffloadApi.h to the tree, and update it on API change. It will be less error prone than this indirect replication.

Because the header is autogenerated after LLVM builds

We don't, but the idea is that this will be ABI stable from now until the future (Still 0.0 right now)

Even if we dependended on the headers we'd still need the dynamic interface to make this a common LLVM tool (since I use it for testing)

There should be a bit of an explanation in the file header comment, but I can make it more verbose.

If the ABI is relatively stable, perhaps just commit it to the source tree, and regenerate it only when it changes?

sarnex · 2025-10-10T15:59:58Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef enum ol_platform_backend_t {
+  OL_PLATFORM_BACKEND_UNKNOWN = 0,
+  OL_PLATFORM_BACKEND_CUDA = 1,
+  OL_PLATFORM_BACKEND_AMDGPU = 2,


should we add one for LEVEL_ZERO even though the plugin isn't merged yet?

No, that should happen when there's actual support for it.

sarnex · 2025-10-10T16:01:39Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef struct ol_symbol_impl_t *ol_symbol_handle_t;
+typedef const struct ol_error_struct_t *ol_result_t;
+
+typedef bool (*ol_device_iterate_cb_t)(ol_device_handle_t Device,


nit: can we use using x = y; instead of typedef? llvm overall requires c++17 so i think it should be fine

These are copied from the OffloadAPI header, I didn't think it was necessary to port it to C++17 since we might need to copy past it again in the future.

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp

JonChesterfield · 2025-10-15T08:32:33Z

I'm surprised to see this. The point of the standalone loader was to not inflict the failure modes of openmp on libc. I'll ping Joseph offline for the context.

jhuber6 · 2025-10-15T12:10:48Z

I'm surprised to see this. The point of the standalone loader was to not inflict the failure modes of openmp on libc. I'll ping Joseph offline for the context.

Ideally the offloading API is split enough from OpenMP that we don't have any issues like that. Using the offloading runtime here just makes it easier to manage, as you can see it solves a linked issue and removes some FIXME's. But there's definitely some remaining issues to fix related to eagerly initializing runtimes we don't need.

jhuber6 · 2025-10-23T18:51:33Z

Ping, somewhat hoping to reference this in my talk.

JonChesterfield

Yep, let's have this.

All my misgivings with how llvm offload has been deployed remain, but this is still a good idea. Happy hunting any bugs that appear.

Artem-B

Drive-by review. Happy to see -1K diff delta.
LGTM overall, modulo the enum type shenanigans.

Artem-B · 2025-10-24T22:39:29Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef enum ol_alloc_type_t {
+  OL_ALLOC_TYPE_HOST = 0,
+  OL_ALLOC_TYPE_DEVICE = 1,
+  OL_ALLOC_TYPE_FORCE_UINT32 = 0x7fffffff


Is there a particular reason we need specific underlying type for those enums?

If those are intended to be used as the stable storage types, we should use explicit bit-stable data type for that. If we do not care about specific in memory representation, then we probably don't need those large-value enums either.

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp

Summary: This patch rewrites the `llvm-gpu-loader` utility to use the LLVMOffload interface. This heavily simplifies it while re-using the already existing support. Another benefit is that I can now easily do this dynamically so we can always build this utility without needing to find non-standard packages. One issue is mentioned in llvm#159636 where this will now take extra time if you have both installed on the same machine. This is just slightly annoying since most people don't have both CUDA and ROCm at the same time so I don't consider it a blocker. I will work later to address it. Slightly unfortunate environment variable usage, I will also expose that better in the future.

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

kevinsala · 2025-11-03T04:21:55Z

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h

+typedef enum ol_alloc_type_t {
+  OL_ALLOC_TYPE_HOST = 0,
+  OL_ALLOC_TYPE_DEVICE = 1,
+  OL_ALLOC_TYPE_FORCE_UINT32 = 0x7fffffff


Isn't it possible that a compiler may decide to use uint64_t for enums here?

jhuber6 requested review from Artem-B, JonChesterfield, MaskRay, RossBrunton, arsenm, callumfare, jdoerfert, kevinsala, mgorny, sarnex and shiltian October 9, 2025 21:28

llvmbot added backend:AMDGPU libc backend:NVPTX labels Oct 9, 2025

sarnex reviewed Oct 10, 2025

View reviewed changes

jhuber6 force-pushed the loader branch from 3694816 to 84fdde8 Compare October 13, 2025 21:42

shiltian reviewed Oct 14, 2025

View reviewed changes

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp Outdated Show resolved Hide resolved

jhuber6 force-pushed the loader branch from 84fdde8 to d57c79e Compare October 14, 2025 17:14

JonChesterfield approved these changes Oct 23, 2025

View reviewed changes

Artem-B reviewed Oct 24, 2025

View reviewed changes

kevinsala reviewed Nov 2, 2025

View reviewed changes

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp Outdated Show resolved Hide resolved

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.cpp Outdated Show resolved Hide resolved

jhuber6 added 2 commits November 2, 2025 11:40

comment

7678b6b

Comments

d01b6dd

jhuber6 force-pushed the loader branch from d57c79e to d01b6dd Compare November 2, 2025 17:40

kevinsala reviewed Nov 3, 2025

View reviewed changes

llvm/tools/llvm-gpu-loader/llvm-gpu-loader.h Show resolved Hide resolved

kevinsala reviewed Nov 3, 2025

View reviewed changes

[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739

Are you sure you want to change the base?

[LLVM] Port 'llvm-gpu-loader' to use LLVMOffload #162739

Uh oh!

Conversation

jhuber6 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 commented Oct 9, 2025

Uh oh!

Artem-B commented Oct 9, 2025

Uh oh!

jhuber6 commented Oct 9, 2025

Uh oh!

sarnex left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sarnex Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JonChesterfield commented Oct 15, 2025

Uh oh!

jhuber6 commented Oct 15, 2025

Uh oh!

jhuber6 commented Oct 23, 2025

Uh oh!

JonChesterfield left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Artem-B left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

jhuber6 commented Oct 9, 2025 •

edited

Loading

llvmbot commented Oct 9, 2025 •

edited

Loading

sarnex Oct 10, 2025 •

edited

Loading

JonChesterfield left a comment •

edited

Loading