Skip to content

Commit 1b8626b

Browse files
authored
Improve HIP docs on fat binary registration ordering (#168566)
Clarify how Clang-generated HIP fat binaries are registered and unregistered with the HIP runtime, and how this interacts with global constructors, destructors, and atexit handlers. Document that there is no strong guarantee on ordering relative to user-defined global ctors/dtors, recommend that HIP application developers avoid using kernels or device variables from global ctors/dtors, and describe the implications for HIP runtime developers (synchronization and guards in __hipRegisterFatBinary/__hipUnregisterFatBinary). This is motivated by questions from HIP application and runtime developers about fat binary registration/unregistration order and its potential interference with their own initialization and teardown code.
1 parent e23328b commit 1b8626b

File tree

1 file changed

+89
-0
lines changed

1 file changed

+89
-0
lines changed

clang/docs/HIPSupport.rst

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,95 @@ Host Code Compilation
210210
- These relocatable objects are then linked together.
211211
- Host code within a TU can call host functions and launch kernels from another TU.
212212

213+
HIP Fat Binary Registration and Unregistration
214+
==============================================
215+
216+
When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
217+
binaries" and generates host-side helper functions that register these
218+
fat binaries with the HIP runtime at program start and unregister them at
219+
program exit. In non-RDC mode (``-fno-gpu-rdc``), each compilation unit
220+
typically produces its own HIP fat binary: a container that holds, for every
221+
enabled GPU architecture, a fully linked offloading device image (for example,
222+
a GPU code object) that can be loaded directly by the HIP runtime. In RDC mode
223+
(``-fgpu-rdc``), each compilation unit contributes device code in a relocatable
224+
form (for example, GPU object files or LLVM IR). A later device-link step links
225+
those relocatable inputs into fully linked device images per GPU architecture
226+
and then packages those images into a HIP fat binary container.
227+
228+
Registering a HIP fat binary allows the runtime to discover the kernels and
229+
device variables defined in that container and to associate host-side addresses
230+
and symbols with the corresponding GPU-side entities. For example, when a
231+
host-side kernel launch stub is called, the HIP runtime uses information
232+
established during registration (and the fat binary handle it returned) to
233+
identify which GPU kernel symbol to launch from which device image.
234+
235+
At the LLVM IR level, Clang/LLVM typically create an internal module
236+
constructor (for example ``__hip_module_ctor`` or a ``.hip.fatbin_reg``
237+
function) and add it to ``@llvm.global_ctors``. This constructor is called by
238+
the C runtime before ``main`` and it:
239+
240+
* calls ``__hipRegisterFatBinary`` with a pointer to an internal wrapper
241+
object that describes the HIP fat binary;
242+
* stores the returned handle in an internal global variable;
243+
* calls an internal helper such as ``__hip_register_globals`` to register
244+
kernels, device variables and other metadata associated with the fat binary;
245+
* registers a corresponding module destructor with ``atexit`` so it will run
246+
during program termination and use the stored handle to unregister the fat
247+
binary from the HIP runtime.
248+
249+
The module destructor (for example ``__hip_module_dtor`` or a
250+
``.hip.fatbin_unreg`` function) loads the stored handle, checks that it is
251+
non-null, calls ``__hipUnregisterFatBinary`` to unregister the fat binary from
252+
the HIP runtime, and then clears the handle. This ensures that the HIP runtime
253+
sees each fat binary registered exactly once and that it is unregistered once
254+
at exit, even when multiple translation units contribute HIP kernels to the
255+
same host program.
256+
257+
These registration/unregistration helpers are implementation details of Clang's
258+
HIP code generation; user code should not call ``__hipRegisterFatBinary`` or
259+
``__hipUnregisterFatBinary`` directly.
260+
261+
Implications for HIP Application Developers
262+
-------------------------------------------
263+
264+
From the point of view of HIP application code, Clang and the HIP runtime
265+
provide the following guarantees:
266+
267+
* Kernels and device variables defined in HIP code will be registered with the
268+
HIP runtime before ``main`` begins execution.
269+
* Fat binaries will be unregistered via an ``atexit``-registered module
270+
destructor after ``main`` returns (or after ``exit`` is called).
271+
272+
Beyond these points, the detailed ordering of fat binary registration and
273+
unregistration relative to user-defined global constructors, destructors and
274+
other ``atexit`` handlers is not specified and should not be relied upon.
275+
Applications should avoid depending on HIP kernels or device variables being
276+
usable from global constructors or destructors, and instead perform HIP
277+
initialization and teardown that touches device state in ``main`` (or in
278+
functions called from ``main``).
279+
280+
Implications for HIP Runtime Developers
281+
---------------------------------------
282+
283+
HIP runtime implementations that are linked with Clang-generated host code
284+
must handle registration and unregistration in the presence of uncertain
285+
global ctor/dtor ordering:
286+
287+
* ``__hipRegisterFatBinary`` must accept a pointer to the compiler-generated
288+
wrapper object and return an opaque handle that remains valid for as long as
289+
the fat binary may be used.
290+
* ``__hipUnregisterFatBinary`` must accept the handle previously returned by
291+
``__hipRegisterFatBinary`` and perform any necessary cleanup. It may be
292+
called late in process teardown, after other parts of the runtime have
293+
started shutting down, so it should be robust in the presence of partially
294+
torn-down state.
295+
* Runtimes should use appropriate synchronization and guards so that fat
296+
binary registration does not observe uninitialized resources and
297+
unregistration does not release resources that are still required by other
298+
runtime components. In particular, registration and unregistration routines
299+
should be written to be safe under repeated calls and in the presence of
300+
concurrent or overlapping initialization/teardown logic.
301+
213302
Syntax Difference with CUDA
214303
===========================
215304

0 commit comments

Comments
 (0)