@@ -210,6 +210,95 @@ Host Code Compilation
210210- These relocatable objects are then linked together.
211211- Host code within a TU can call host functions and launch kernels from another TU.
212212
213+ HIP Fat Binary Registration and Unregistration
214+ ==============================================
215+
216+ When compiling HIP for AMD GPUs, Clang embeds device code into HIP "fat
217+ binaries" and generates host-side helper functions that register these
218+ fat binaries with the HIP runtime at program start and unregister them at
219+ program exit. In non-RDC mode (``-fno-gpu-rdc ``), each compilation unit
220+ typically produces its own HIP fat binary: a container that holds, for every
221+ enabled GPU architecture, a fully linked offloading device image (for example,
222+ a GPU code object) that can be loaded directly by the HIP runtime. In RDC mode
223+ (``-fgpu-rdc ``), each compilation unit contributes device code in a relocatable
224+ form (for example, GPU object files or LLVM IR). A later device-link step links
225+ those relocatable inputs into fully linked device images per GPU architecture
226+ and then packages those images into a HIP fat binary container.
227+
228+ Registering a HIP fat binary allows the runtime to discover the kernels and
229+ device variables defined in that container and to associate host-side addresses
230+ and symbols with the corresponding GPU-side entities. For example, when a
231+ host-side kernel launch stub is called, the HIP runtime uses information
232+ established during registration (and the fat binary handle it returned) to
233+ identify which GPU kernel symbol to launch from which device image.
234+
235+ At the LLVM IR level, Clang/LLVM typically create an internal module
236+ constructor (for example ``__hip_module_ctor `` or a ``.hip.fatbin_reg ``
237+ function) and add it to ``@llvm.global_ctors ``. This constructor is called by
238+ the C runtime before ``main `` and it:
239+
240+ * calls ``__hipRegisterFatBinary `` with a pointer to an internal wrapper
241+ object that describes the HIP fat binary;
242+ * stores the returned handle in an internal global variable;
243+ * calls an internal helper such as ``__hip_register_globals `` to register
244+ kernels, device variables and other metadata associated with the fat binary;
245+ * registers a corresponding module destructor with ``atexit `` so it will run
246+ during program termination and use the stored handle to unregister the fat
247+ binary from the HIP runtime.
248+
249+ The module destructor (for example ``__hip_module_dtor `` or a
250+ ``.hip.fatbin_unreg `` function) loads the stored handle, checks that it is
251+ non-null, calls ``__hipUnregisterFatBinary `` to unregister the fat binary from
252+ the HIP runtime, and then clears the handle. This ensures that the HIP runtime
253+ sees each fat binary registered exactly once and that it is unregistered once
254+ at exit, even when multiple translation units contribute HIP kernels to the
255+ same host program.
256+
257+ These registration/unregistration helpers are implementation details of Clang's
258+ HIP code generation; user code should not call ``__hipRegisterFatBinary `` or
259+ ``__hipUnregisterFatBinary `` directly.
260+
261+ Implications for HIP Application Developers
262+ -------------------------------------------
263+
264+ From the point of view of HIP application code, Clang and the HIP runtime
265+ provide the following guarantees:
266+
267+ * Kernels and device variables defined in HIP code will be registered with the
268+ HIP runtime before ``main `` begins execution.
269+ * Fat binaries will be unregistered via an ``atexit ``-registered module
270+ destructor after ``main `` returns (or after ``exit `` is called).
271+
272+ Beyond these points, the detailed ordering of fat binary registration and
273+ unregistration relative to user-defined global constructors, destructors and
274+ other ``atexit `` handlers is not specified and should not be relied upon.
275+ Applications should avoid depending on HIP kernels or device variables being
276+ usable from global constructors or destructors, and instead perform HIP
277+ initialization and teardown that touches device state in ``main `` (or in
278+ functions called from ``main ``).
279+
280+ Implications for HIP Runtime Developers
281+ ---------------------------------------
282+
283+ HIP runtime implementations that are linked with Clang-generated host code
284+ must handle registration and unregistration in the presence of uncertain
285+ global ctor/dtor ordering:
286+
287+ * ``__hipRegisterFatBinary `` must accept a pointer to the compiler-generated
288+ wrapper object and return an opaque handle that remains valid for as long as
289+ the fat binary may be used.
290+ * ``__hipUnregisterFatBinary `` must accept the handle previously returned by
291+ ``__hipRegisterFatBinary `` and perform any necessary cleanup. It may be
292+ called late in process teardown, after other parts of the runtime have
293+ started shutting down, so it should be robust in the presence of partially
294+ torn-down state.
295+ * Runtimes should use appropriate synchronization and guards so that fat
296+ binary registration does not observe uninitialized resources and
297+ unregistration does not release resources that are still required by other
298+ runtime components. In particular, registration and unregistration routines
299+ should be written to be safe under repeated calls and in the presence of
300+ concurrent or overlapping initialization/teardown logic.
301+
213302Syntax Difference with CUDA
214303===========================
215304
0 commit comments