diff --git a/.wordlist.txt b/.wordlist.txt
index 32d489abc8..6cbf374ae1 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -7,11 +7,15 @@ APUs
AQL
AXPY
asm
+Asynchronicity
+Asynchrony
asynchrony
backtrace
+bfloat
Bitcode
bitcode
bitcodes
+bitmask
blockDim
blockIdx
builtins
@@ -24,7 +28,8 @@ coroutines
Ctx
cuBLASLt
cuCtx
-CUDA's
+CUDA
+cuda
cuDNN
cuModule
dataflow
@@ -32,10 +37,10 @@ deallocate
decompositions
denormal
Dereferencing
+DFT
dll
DirectX
EIGEN
-EIGEN's
enqueue
enqueues
entrypoint
@@ -61,7 +66,6 @@ hardcoded
HC
hcBLAS
HIP-Clang
-HIP's
hipcc
hipCtx
hipexamine
@@ -71,6 +75,7 @@ hipModule
hipModuleLaunchKernel
hipother
HIPRTC
+hyperthreading
icc
IILE
iGPU
@@ -91,7 +96,6 @@ iteratively
Lapack
latencies
libc
-libhipcxx
libstdc
lifecycle
linearizing
@@ -116,6 +120,7 @@ NDRange
nonnegative
NOP
Numa
+ns
Nsight
ocp
omnitrace
@@ -124,6 +129,7 @@ overindexing
oversubscription
overutilized
parallelizable
+pipelining
parallelized
pixelated
pragmas
@@ -142,7 +148,6 @@ quad
representable
RMW
rocgdb
-ROCm's
rocTX
roundtrip
rst
@@ -155,10 +160,10 @@ sceneries
shaders
SIMT
sinewave
+sinf
SOMA
SPMV
structs
-struct's
SYCL
syntaxes
texel
@@ -169,8 +174,11 @@ templated
toolkits
transfering
typedefs
+ULP
+ULPs
unintuitive
UMM
+uncoalesced
unmap
unmapped
unmapping
diff --git a/README.md b/README.md
index 4df0b4c6a9..57ff69619b 100644
--- a/README.md
+++ b/README.md
@@ -36,33 +36,6 @@ HIP releases are typically naming convention for each ROCM release to help diffe
* rocm x.yy: These are the stable releases based on the ROCM release.
This type of release is typically made once a month.*
-## More Info
-
-* [Installation](docs/install/install.rst)
-* [HIP FAQ](docs/faq.rst)
-* [HIP C++ Language Extensions](docs/reference/cpp_language_extensions.rst)
-* [HIP Porting Guide](docs/how-to/hip_porting_guide.md)
-* [HIP Porting Driver Guide](docs/how-to/hip_porting_driver_api.rst)
-* [HIP Programming Guide](docs/programming_guide.rst)
-* [HIP Logging](docs/how-to/logging.rst)
-* [Building HIP From Source](docs/install/build.rst)
-* [HIP Debugging](docs/how-to/debugging.rst)
-* [HIP RTC](docs/how-to/hip_rtc.md)
-* [HIP Terminology](docs/reference/terms.md) (including Rosetta Stone of GPU computing terms across CUDA/HIP/OpenCL)
-* [HIPIFY](https://github.com/ROCm/HIPIFY/blob/amd-staging/README.md)
-* Supported CUDA APIs:
- * [Runtime API](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUDA_Runtime_API_functions_supported_by_HIP.md)
- * [Driver API](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUDA_Driver_API_functions_supported_by_HIP.md)
- * [cuComplex API](https://github.com/ROCm/HIPIFY/blob/amd-staging/reference/docs/tables/cuComplex_API_supported_by_HIP.md)
- * [Device API](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUDA_Device_API_supported_by_HIP.md)
- * [cuBLAS](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUBLAS_API_supported_by_ROC.md)
- * [cuRAND](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CURAND_API_supported_by_HIP.md)
- * [cuDNN](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUDNN_API_supported_by_HIP.md)
- * [cuFFT](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUFFT_API_supported_by_HIP.md)
- * [cuSPARSE](https://github.com/ROCm/HIPIFY/blob/amd-staging/docs/reference/tables/CUSPARSE_API_supported_by_HIP.md)
-* [Developer/CONTRIBUTING Info](CONTRIBUTING.md)
-* [Release Notes](RELEASE.md)
-
## How do I get set up?
See the [Installation](docs/install/install.rst) notes.
@@ -91,7 +64,7 @@ hipMemcpy(C_h, C_d, Nbytes, hipMemcpyDeviceToHost);
The HIP kernel language defines builtins for determining grid and block coordinates, math functions, short vectors,
atomics, and timer functions.
-It also specifies additional defines and keywords for function types, address spaces, and optimization controls (See the [HIP C++ Language Extensions](docs/reference/cpp_language_extensions.rst) for a full description).
+It also specifies additional defines and keywords for function types, address spaces, and optimization controls (See the [HIP C++ Language Extensions](docs/how-to/hip_cpp_language_extensions.rst) for a full description).
Here's an example of defining a simple 'vector_square' kernel.
```cpp
diff --git a/RELEASE.md b/RELEASE.md
deleted file mode 100644
index 15fb221549..0000000000
--- a/RELEASE.md
+++ /dev/null
@@ -1,216 +0,0 @@
-# Release notes
-
-We have attempted to document known bugs and limitations - in particular the [HIP Kernel Language](docs/markdown/hip_kernel_language.md) document uses the phrase "Under Development", and the [HIP Runtime API issue list](https://github.com/ROCm/HIP/issues) lists known bugs.
-
-
-===================================================================================================
-
-
-## Revision History:
-
-===================================================================================================
-Release: 1.5
-Date:
-- Support threadIdx, blockIdx, blockDim directly (no need for hipify conversions in kernels.) HIP
- Kernel syntax is now identical to CUDA kernel syntax - no need for extra parms or conversions.
-- Refactor launch syntax. HIP now extracts kernels from the executable and launches them using the
- existing module interface. Kernels dispatch no longer flows through HCC. Result is faster
- kernel launches and with less resource usage (no signals required).
-- Remove requirement for manual "serializers" previously required when passing complex structures
- into kernels.
-- Remove need for manual destructors
-- Provide printf in device code
-- Support for globals when using module API
-- hipify-clang now supports using newer versions of clang
-- HIP texture support equivalent to CUDA texture driver APIs
-- Updates to hipify-perl, hipify-clang and documentation
-
-
-===================================================================================================
-Release: 1.4
-Date: 2017.10.06
-- Improvements to HIP event management
-- Added new HIP_TRACE_API options
-- Enabled device side assert support
-- Several bug fixes including hipMallocArray, hipTexture fetch
-- Support for RHEL/CentOS 7.4
-- Updates to hipify-perl, hipify-clang and documentation
-
-
-===================================================================================================
-Release: 1.3
-Date: 2017.08.16
-- hipcc now auto-detects amdgcn arch. No need to specify the arch when building for same system.
-- HIP texture support (run-time APIs)
-- Implemented __threadfence_support
-- Improvements in HIP context management logic
-- Bug fixes in several APIs including hipDeviceGetPCIBusId, hipEventDestroy, hipMemcpy2DAsync
-- Updates to hipify-clang and documentation
-- HIP development now fully open and on GitHub. Developers should submit pull requests.
-
-
-===================================================================================================
-Release: 1.2
-Date: 2017.06.29
-- new APIs: hipMemcpy2DAsync, hipMallocPitch, hipHostMallocCoherent, hipHostMallocNonCoherent
-- added support for building hipify-clang using clang 3.9
-- hipify-clang updates for CUDA 8.0 runtime+driver support
-- renamed hipify to hipify-perl
-- initial implementation of hipify-cmakefile
-- several documentation updates & bug fixes
-- support for abort() function in device code
-
-
-===================================================================================================
-Release: 1.0.17102
-Date: 2017.03.07
-- Lots of improvements to hipify-clang.
-- Added HIP package config for cmake.
-- Several bug fixes and documentation updates.
-
-
-===================================================================================================
-Release: 1.0.17066
-Date: 2017.02.11
-- Improved support for math device functions.
-- Added several half math device functions.
-- Enabled support for CUDA 8.0 in hipify-clang.
-- Lots of bug fixes and documentation updates.
-
-
-===================================================================================================
-Release: 1.0.17015
-Date: 2017.01.06
-- Several improvements to the hipify-clang infrastructure.
-- Refactored module and function APIs.
-- HIP now defaults to linking against the shared runtime library.
-- Documentation updates.
-
-
-===================================================================================================
-Release: 1.0.16502
-Date: 2016.12.13
-- Added several fast math and packaged math instrincs
-- Improved debug and profiler documentation
-- Support for building and linking to HIP shared library
-- Several improvements to hipify-clang
-- Several bug fixes
-
-
-===================================================================================================
-Release: 1.0.16461
-Date: 2016.11.14
-- Significant changes to the HIP Profiling APIs. Refer to the documentation for details
-- Improvements to P2P support
-- New API: hipDeviceGetByPCIBusId
-- Several bug fixes in NV path
-- hipModuleLaunch now works for multi-dim kernels
-
-
-===================================================================================================
-Release:1.0
-Date: 2016.11.8
-- Initial implementation for FindHIP.cmake
-- HIP library now installs as a static library by default
-- Added support for HIP context and HIP module APIs
-- Major changes to HIP signal & memory management implementation
-- Support for complex data type and math functions
-- clang-hipify is now known as hipify-clang
-- Added several new HIP samples
-- Preliminary support for new APIs: hipMemcpyToSymbol, hipDeviceGetLimit, hipRuntimeGetVersion
-- Added support for async memcpy driver API (for example hipMemcpyHtoDAsync)
-- Support for memory management device functions: malloc, free, memcpy & memset
-- Removed deprecated HIP runtime header locations. Please include "hip/hip_runtime.h" instead of "hip_runtime.h". You can use `find . -type f -exec sed -i 's:#include "hip_runtime.h":#include "hip/hip_runtime.h":g' {} +` to replace all such references
-
-
-===================================================================================================
-Release:0.92.00
-Date: 2016.8.14
-- hipLaunchKernel supports one-dimensional grid and/or block dims, without explicit cast to dim3 type (actually in 0.90.00)
-- fp16 software support
-- Support for Hawaii dGPUs using environment variable ROCM_TARGET=hawaii
-- Support hipArray
-- Improved profiler support
-- Documentation updates
-- Improvements to clang-hipify
-
-
-===================================================================================================
-Release:0.90.00
-Date: 2016.06.29
-- Support dynamic shared memory allocations
-- Min HCC compiler version is > 16186.
-- Expanded math functions (device and host). Document unsupported functions.
-- hipFree with null pointer initializes runtime and returns success.
-- Improve error code reporting on nvcc.
-- Add hipPeekAtError for nvcc.
-
-
-===================================================================================================
-Release:0.86.00
-Date: 2016.06.06
-- Add clang-hipify : clang-based hipify tool. Improved parsing of source code, and automates
- creation of hipLaunchParm variable.
-- Implement memory register / unregister commands (hipHostRegister, hipHostUnregister)
-- Add cross-linking support between G++ and HCC, in particular for interfaces that use
- standard C++ libraries (ie std::vectors, std::strings). HIPCC now uses libstdc++ by default on the HCC
- compilation path.
-- More samples including gpu-burn, SHOC, nbody, rtm. See [HIP-Examples](https://github.com/ROCm/HIP-Examples)
-
-
-===================================================================================================
-Release:0.84.01
-Date: 2016.04.25
-- Refactor HIP make and install system:
- - Move to CMake. Refer to the installation section in README.md for details.
- - Split source into multiple modular .cpp and .h files.
- - Create static library and link.
- - Set HIP_PATH to install.
-- Make hipDevice and hipStream thread-safe.
- - Preferred hipStream usage is still to create new streams for each new thread, but it works even if you don;t.
-- Improve automated platform detection: If AMD GPU is installed and detected by driver, default HIP_PLATFORM to hcc.
-- HIP_TRACE_API now prints arguments to the HIP function (in addition to name of function).
-- Deprecate hipDeviceGetProp (Replace with hipGetDeviceProp)
-- Deprecate hipMallocHost (Replace with hipHostMalloc)
-- Deprecate hipFreeHost (Replace with hipHostFree)
-- The mixbench benchmark tool for measuring operational intensity now has a HIP target, in addition to CUDA and OpenCL. Let the comparisons begin. :)
-See here for more : https://github.com/ekondis/mixbench.
-
-
-===================================================================================================
-Release:0.82.00
-Date: 2016.03.07
-- Bump minimum required HCC workweek to 16074.
-- Bump minimum required ROCK-Kernel-Driver and ROCR-Runtime to Developer Preview 2.
-- Enable multi-GPU support.
- * Use hipSetDevice to select a device for subsequent kernel calls and memory allocations.
- * CUDA_VISIBLE_DEVICES / HIP_VISIBLE_DEVICE environment variable selects devices visible to the runtime.
-- Support hipStreams – send sequences of copy and kernel commands to a device.
- * Asynchronous copies supported.
-- Optimize memory copy operations.
-- Support hipPointerGetAttribute – can determine if a pointer is host or device.
-- Enable atomics to local memory.
-- Support for LC Direct-To-ISA path.
-- Improved free memory reporting.
- * hipMemGetInfo (report full memory used in current process).
- * hipDeviceReset (deletes all memory allocated by current process).
-
-
-===================================================================================================
-Release:0.80.01
-Date: 2016.02.18
-- Improve reporting and support for device-side math functions.
-- Update Runtime Documentation.
-- Improve implementations of cross-lane operations (_ballot, _any, _all).
-- Provide shuffle intrinsics (performance optimization in-progress).
-- Support hipDeviceAttribute for querying "one-shot" device attributes, as an alternative to hipGetDeviceProperties.
-
-
-===================================================================================================
-Release:0.80.00
-Date: 2016.01.25
-
-Initial release with GPUOpen Launch.
-
-
-
diff --git a/docs/conf.py b/docs/conf.py
index 8261240fb0..6e4b994bfb 100644
--- a/docs/conf.py
+++ b/docs/conf.py
@@ -57,4 +57,4 @@
"understand/glossary.md",
'how-to/debugging_env.rst',
"data/env_variables_hip.rst"
-]
\ No newline at end of file
+]
diff --git a/docs/data/env_variables_hip.rst b/docs/data/env_variables_hip.rst
index 6186671ecf..4192db7387 100644
--- a/docs/data/env_variables_hip.rst
+++ b/docs/data/env_variables_hip.rst
@@ -2,6 +2,9 @@
:description: HIP environment variables
:keywords: AMD, HIP, environment variables, environment
+HIP GPU isolation variables
+--------------------------------------------------------------------------------
+
The GPU isolation environment variables in HIP are collected in the following table.
.. _hip-env-isolation:
@@ -24,6 +27,9 @@ The GPU isolation environment variables in HIP are collected in the following ta
| Device indices exposed to HIP applications.
- Example: ``0,2``
+HIP profiling variables
+--------------------------------------------------------------------------------
+
The profiling environment variables in HIP are collected in the following table.
.. _hip-env-prof:
@@ -50,6 +56,9 @@ The profiling environment variables in HIP are collected in the following table.
- | 0: Disable
| 1: Enable
+HIP debug variables
+--------------------------------------------------------------------------------
+
The debugging environment variables in HIP are collected in the following table.
.. _hip-env-debug:
@@ -149,6 +158,9 @@ The debugging environment variables in HIP are collected in the following table.
number does not apply to hardware queues that are created for CU-masked HIP streams, or
cooperative queues for HIP Cooperative Groups (single queue per device).
+HIP memory management related variables
+--------------------------------------------------------------------------------
+
The memory management related environment variables in HIP are collected in the
following table.
@@ -245,6 +257,9 @@ following table.
- | 0: Disable
| 1: Enable
+HIP miscellaneous variables
+--------------------------------------------------------------------------------
+
The following table lists environment variables that are useful but relate to
different features in HIP.
diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison.drawio b/docs/data/understand/programming_model/cpu-gpu-comparison.drawio
new file mode 100644
index 0000000000..a7e851b3d5
--- /dev/null
+++ b/docs/data/understand/programming_model/cpu-gpu-comparison.drawio
@@ -0,0 +1,181 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/cpu-gpu-comparison.svg b/docs/data/understand/programming_model/cpu-gpu-comparison.svg
new file mode 100644
index 0000000000..552290299f
--- /dev/null
+++ b/docs/data/understand/programming_model/cpu-gpu-comparison.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/host-device-flow.drawio b/docs/data/understand/programming_model/host-device-flow.drawio
new file mode 100644
index 0000000000..2ee8c43ae9
--- /dev/null
+++ b/docs/data/understand/programming_model/host-device-flow.drawio
@@ -0,0 +1,61 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/host-device-flow.svg b/docs/data/understand/programming_model/host-device-flow.svg
new file mode 100644
index 0000000000..02bce96c5d
--- /dev/null
+++ b/docs/data/understand/programming_model/host-device-flow.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/memory-access.drawio b/docs/data/understand/programming_model/memory-access.drawio
new file mode 100644
index 0000000000..3577772532
--- /dev/null
+++ b/docs/data/understand/programming_model/memory-access.drawio
@@ -0,0 +1,237 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/memory-access.svg b/docs/data/understand/programming_model/memory-access.svg
new file mode 100644
index 0000000000..5f0dbd8aae
--- /dev/null
+++ b/docs/data/understand/programming_model/memory-access.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/multi-gpu.drawio b/docs/data/understand/programming_model/multi-gpu.drawio
new file mode 100644
index 0000000000..17eca3c318
--- /dev/null
+++ b/docs/data/understand/programming_model/multi-gpu.drawio
@@ -0,0 +1,64 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/multi-gpu.svg b/docs/data/understand/programming_model/multi-gpu.svg
new file mode 100644
index 0000000000..190f2593d2
--- /dev/null
+++ b/docs/data/understand/programming_model/multi-gpu.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/simt-execution.drawio b/docs/data/understand/programming_model/simt-execution.drawio
new file mode 100644
index 0000000000..1e2652f51f
--- /dev/null
+++ b/docs/data/understand/programming_model/simt-execution.drawio
@@ -0,0 +1,124 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/simt-execution.svg b/docs/data/understand/programming_model/simt-execution.svg
new file mode 100644
index 0000000000..412b9265e7
--- /dev/null
+++ b/docs/data/understand/programming_model/simt-execution.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/simt.drawio b/docs/data/understand/programming_model/simt.drawio
deleted file mode 100644
index 4c5c5a3f26..0000000000
--- a/docs/data/understand/programming_model/simt.drawio
+++ /dev/null
@@ -1,148 +0,0 @@
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/simt.svg b/docs/data/understand/programming_model/simt.svg
deleted file mode 100644
index c149ab88e4..0000000000
--- a/docs/data/understand/programming_model/simt.svg
+++ /dev/null
@@ -1 +0,0 @@
-
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/stream-workflow.drawio b/docs/data/understand/programming_model/stream-workflow.drawio
new file mode 100644
index 0000000000..616dd28d78
--- /dev/null
+++ b/docs/data/understand/programming_model/stream-workflow.drawio
@@ -0,0 +1,97 @@
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/data/understand/programming_model/stream-workflow.svg b/docs/data/understand/programming_model/stream-workflow.svg
new file mode 100644
index 0000000000..9648351cad
--- /dev/null
+++ b/docs/data/understand/programming_model/stream-workflow.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile
index 6570128d00..fb4eaae2de 100644
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
@@ -170,7 +170,8 @@ FULL_PATH_NAMES = YES
# will be relative from the directory where doxygen is started.
# This tag requires that the tag FULL_PATH_NAMES is set to YES.
-STRIP_FROM_PATH =
+STRIP_FROM_PATH = ../../ \
+ ../../../
# The STRIP_FROM_INC_PATH tag can be used to strip a user-defined part of the
# path mentioned in the documentation of a class, which tells the reader which
diff --git a/docs/how-to/hip_cpp_language_extensions.rst b/docs/how-to/hip_cpp_language_extensions.rst
index aa993541e4..2cde72cfdc 100644
--- a/docs/how-to/hip_cpp_language_extensions.rst
+++ b/docs/how-to/hip_cpp_language_extensions.rst
@@ -250,43 +250,6 @@ Units, also known as SIMDs, each with their own register file. For more
information see :doc:`../understand/hardware_implementation`.
:cpp:struct:`hipDeviceProp_t` also has a field ``executionUnitsPerMultiprocessor``.
-Porting from CUDA __launch_bounds__
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-CUDA also defines a ``__launch_bounds__`` qualifier which works similar to HIP's
-implementation, however it uses different parameters:
-
-.. code-block:: cpp
-
- __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
-
-The first parameter is the same as HIP's implementation, but
-``MIN_BLOCKS_PER_MULTIPROCESSOR`` must be converted to
-``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
-blocks and multiprocessors. This conversion is performed automatically by
-:doc:`HIPIFY `, or can be done manually with the following
-equation.
-
-.. code-block:: cpp
-
- MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
-
-Directly controlling the warps per execution unit makes it easier to reason
-about the occupancy, unlike with blocks, where the occupancy depends on the
-block size.
-
-The use of execution units rather than multiprocessors also provides support for
-architectures with multiple execution units per multiprocessor. For example, the
-AMD GCN architecture has 4 execution units per multiprocessor.
-
-maxregcount
-""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
-
-Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
-Instead, users are encouraged to use the ``__launch_bounds__`` directive since
-the parameters are more intuitive and portable than micro-architecture details
-like registers. The directive allows per-kernel control.
-
Memory space qualifiers
================================================================================
@@ -448,9 +411,7 @@ warpSize
================================================================================
The ``warpSize`` constant contains the number of threads per warp for the given
-target device. It can differ between different architectures, and on RDNA
-architectures it can even differ between kernel launches, depending on whether
-they run in CU or WGP mode. See the
+target device. It can differ between different architectures, see the
:doc:`hardware features <../reference/hardware_features>` for more
information.
@@ -469,7 +430,7 @@ compile-time constant on the host. It has to be queried using
applications. NVIDIA devices return 32 for this variable; AMD devices return
64 for gfx9 and 32 for gfx10 and above. While code that assumes a ``warpSize``
of 32 can run on devices with a ``warpSize`` of 64, it only utilizes half of
- the the compute resources.
+ the compute resources.
********************************************************************************
Vector types
diff --git a/docs/how-to/hip_porting_driver_api.rst b/docs/how-to/hip_porting_driver_api.rst
index d4d9da1673..7d7ebbc24d 100644
--- a/docs/how-to/hip_porting_driver_api.rst
+++ b/docs/how-to/hip_porting_driver_api.rst
@@ -1,6 +1,6 @@
.. meta::
:description: This chapter presents how to port the CUDA driver API and showcases equivalent operations in HIP.
- :keywords: AMD, ROCm, HIP, CUDA, driver API
+ :keywords: AMD, ROCm, HIP, CUDA, driver API, porting, port
.. _porting_driver_api:
@@ -8,26 +8,25 @@
Porting CUDA driver API
*******************************************************************************
-NVIDIA provides separate CUDA driver and runtime APIs. The two APIs have
-significant overlap in functionality:
-
-* Both APIs support events, streams, memory management, memory copy, and error
- handling.
-
-* Both APIs deliver similar performance.
+CUDA provides separate driver and runtime APIs. The two APIs generally provide
+the similar functionality and mostly can be used interchangeably, however the
+driver API allows for more fine-grained control over the kernel level
+initialization, contexts and module management. This is all taken care of
+implicitly by the runtime API.
* Driver API calls begin with the prefix ``cu``, while runtime API calls begin
with the prefix ``cuda``. For example, the driver API contains
``cuEventCreate``, while the runtime API contains ``cudaEventCreate``, which
has similar functionality.
-* The driver API defines a different, but largely overlapping, error code space
- than the runtime API and uses a different coding convention. For example, the
- driver API defines ``CUDA_ERROR_INVALID_VALUE``, while the runtime API defines
- ``cudaErrorInvalidValue``.
+* The driver API offers two additional low-level functionalities not exposed by
+ the runtime API: module management ``cuModule*`` and context management
+ ``cuCtx*`` APIs.
-The driver API offers two additional functionalities not provided by the runtime
-API: ``cuModule`` and ``cuCtx`` APIs.
+HIP does not explicitly provide two different APIs, the corresponding functions
+for the CUDA driver API are available in the HIP runtime API, and are usually
+prefixed with ``hipDrv``. The module and context functionality is available with
+the ``hipModule`` and ``hipCtx`` prefix.
cuModule API
================================================================================
@@ -120,12 +119,21 @@ For context reference, visit :ref:`context_management_reference`.
HIPIFY translation of CUDA driver API
================================================================================
-The HIPIFY tools convert CUDA driver APIs for streams, events, modules, devices, memory management, context, and the profiler to the equivalent HIP calls. For example, ``cuEventCreate`` is translated to ``hipEventCreate``.
-HIPIFY tools also convert error codes from the driver namespace and coding conventions to the equivalent HIP error code. HIP unifies the APIs for these common functions.
-
-The memory copy API requires additional explanation. The CUDA driver includes the memory direction in the name of the API (``cuMemcpyH2D``), while the CUDA driver API provides a single memory copy API with a parameter that specifies the direction. It also supports a "default" direction where the runtime determines the direction automatically.
-HIP provides APIs with both styles, for example, ``hipMemcpyH2D`` as well as ``hipMemcpy``.
-The first version might be faster in some cases because it avoids any host overhead to detect the different memory directions.
+The HIPIFY tools convert CUDA driver APIs such as streams, events, modules,
+devices, memory management, context, and the profiler to the equivalent HIP
+calls. For example, ``cuEventCreate`` is translated to :cpp:func:`hipEventCreate`.
+HIPIFY tools also convert error codes from the driver namespace and coding
+conventions to the equivalent HIP error code. HIP unifies the APIs for these
+common functions.
+
+The memory copy API requires additional explanation. The CUDA driver includes
+the memory direction in the name of the API (``cuMemcpyHtoD``), while the CUDA
+runtime API provides a single memory copy API with a parameter that specifies
+the direction. It also supports a "default" direction where the runtime
+determines the direction automatically.
+HIP provides both versions, for example, :cpp:func:`hipMemcpyHtoD` as well as
+:cpp:func:`hipMemcpy`. The first version might be faster in some cases because
+it avoids any host overhead to detect the different memory directions.
HIP defines a single error space and uses camel case for all errors (i.e. ``hipErrorInvalidValue``).
@@ -134,16 +142,25 @@ For further information, visit the :doc:`hipify:index`.
Address spaces
--------------------------------------------------------------------------------
-HIP-Clang defines a process-wide address space where the CPU and all devices allocate addresses from a single unified pool.
-This means addresses can be shared between contexts. Unlike the original CUDA implementation, a new context does not create a new address space for the device.
+HIP-Clang defines a process-wide address space where the CPU and all devices
+allocate addresses from a single unified pool.
+This means addresses can be shared between contexts. Unlike the original CUDA
+implementation, a new context does not create a new address space for the device.
Using hipModuleLaunchKernel
--------------------------------------------------------------------------------
-Both CUDA driver and runtime APIs define a function for launching kernels, called ``cuLaunchKernel`` or ``cudaLaunchKernel``. The equivalent API in HIP is ``hipModuleLaunchKernel``.
-The kernel arguments and the execution configuration (grid dimensions, group dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function.
-The runtime API additionally provides the ``<<< >>>`` syntax for launching kernels, which resembles a special function call and is easier to use than the explicit launch API, especially when handling kernel arguments.
-However, this syntax is not standard C++ and is available only when NVCC is used to compile the host code.
+Both CUDA driver and runtime APIs define a function for launching kernels,
+called ``cuLaunchKernel`` or ``cudaLaunchKernel``. The equivalent API in HIP is
+``hipModuleLaunchKernel``.
+The kernel arguments and the execution configuration (grid dimensions, group
+dimensions, dynamic shared memory, and stream) are passed as arguments to the
+launch function.
+The runtime API additionally provides the ``<<< >>>`` syntax for launching
+kernels, which resembles a special function call and is easier to use than the
+explicit launch API, especially when handling kernel arguments.
+However, this syntax is not standard C++ and is available only when NVCC is used
+to compile the host code.
Additional information
--------------------------------------------------------------------------------
@@ -186,12 +203,24 @@ functions.
Kernel launching
--------------------------------------------------------------------------------
-HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax, ``hipLaunchKernel``, or ``hipLaunchKernelGGL``. The last option is a macro which expands to the CUDA ``<<<>>>`` syntax by default. It can also be turned into a template by defining ``HIP_TEMPLATE_KERNEL_LAUNCH``.
+HIP-Clang supports kernel launching using either the CUDA ``<<<>>>`` syntax,
+``hipLaunchKernel``, or ``hipLaunchKernelGGL``. The last option is a macro which
+expands to the CUDA ``<<<>>>`` syntax by default. It can also be turned into a
+template by defining ``HIP_TEMPLATE_KERNEL_LAUNCH``.
-When the executable or shared library is loaded by the dynamic linker, the initialization functions are called. In the initialization functions, the code objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is called. When ``__hipRegisterFunction`` is called, the stub functions are associated with the corresponding kernels in the code objects.
+When the executable or shared library is loaded by the dynamic linker, the
+initialization functions are called. In the initialization functions, the code
+objects containing all kernels are loaded when ``__hipRegisterFatBinary`` is
+called. When ``__hipRegisterFunction`` is called, the stub functions are
+associated with the corresponding kernels in the code objects.
HIP-Clang implements two sets of APIs for launching kernels.
-By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code, it first calls ``hipConfigureCall`` to set up the threads and grids. It then calls the stub function with the given arguments. The stub function calls ``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr`` with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual kernel associated with the stub function is launched.
+By default, when HIP-Clang encounters the ``<<<>>>`` statement in the host code,
+it first calls ``hipConfigureCall`` to set up the threads and grids. It then
+calls the stub function with the given arguments. The stub function calls
+``hipSetupArgument`` for each kernel argument, then calls ``hipLaunchByPtr``
+with a function pointer to the stub function. In ``hipLaunchByPtr``, the actual
+kernel associated with the stub function is launched.
NVCC implementation notes
================================================================================
@@ -199,7 +228,9 @@ NVCC implementation notes
Interoperation between HIP and CUDA driver
--------------------------------------------------------------------------------
-CUDA applications might want to mix CUDA driver code with HIP code (see the example below). This table shows the equivalence between CUDA and HIP types required to implement this interaction.
+CUDA applications might want to mix CUDA driver code with HIP code (see the
+example below). This table shows the equivalence between CUDA and HIP types
+required to implement this interaction.
.. list-table:: Equivalence table between HIP and CUDA types
:header-rows: 1
@@ -547,3 +578,72 @@ The HIP version number is defined as an integer:
.. code-block:: cpp
HIP_VERSION=HIP_VERSION_MAJOR * 10000000 + HIP_VERSION_MINOR * 100000 + HIP_VERSION_PATCH
+
+CU_POINTER_ATTRIBUTE_MEMORY_TYPE
+================================================================================
+
+To get the pointer's memory type in HIP, developers should use
+:cpp:func:`hipPointerGetAttributes`. First parameter of the function is
+`hipPointerAttribute_t`. Its ``type`` member variable indicates whether the
+memory pointed to is allocated on the device or the host.
+
+For example:
+
+.. code-block:: cpp
+
+ double * ptr;
+ hipMalloc(&ptr, sizeof(double));
+ hipPointerAttribute_t attr;
+ hipPointerGetAttributes(&attr, ptr); /*attr.type is hipMemoryTypeDevice*/
+ if(attr.type == hipMemoryTypeDevice)
+ std::cout << "ptr is of type hipMemoryTypeDevice" << std::endl;
+
+ double* ptrHost;
+ hipHostMalloc(&ptrHost, sizeof(double));
+ hipPointerAttribute_t attr;
+ hipPointerGetAttributes(&attr, ptrHost); /*attr.type is hipMemoryTypeHost*/
+ if(attr.type == hipMemorTypeHost)
+ std::cout << "ptrHost is of type hipMemoryTypeHost" << std::endl;
+
+Note that ``hipMemoryType`` enum values are different from the
+``cudaMemoryType`` enum values.
+
+For example, on AMD platform, `hipMemoryType` is defined in `hip_runtime_api.h`,
+
+.. code-block:: cpp
+
+ typedef enum hipMemoryType {
+ hipMemoryTypeHost = 0, ///< Memory is physically located on host
+ hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId for specific device)
+ hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see deviceId for specific device)
+ hipMemoryTypeUnified = 3, ///< Not used currently
+ hipMemoryTypeManaged = 4 ///< Managed memory, automaticallly managed by the unified memory system
+ } hipMemoryType;
+
+Looking into CUDA toolkit, it defines `cudaMemoryType` as following,
+
+.. code-block:: cpp
+
+ enum cudaMemoryType
+ {
+ cudaMemoryTypeUnregistered = 0, // Unregistered memory.
+ cudaMemoryTypeHost = 1, // Host memory.
+ cudaMemoryTypeDevice = 2, // Device memory.
+ cudaMemoryTypeManaged = 3, // Managed memory
+ }
+
+In this case, memory type translation for ``hipPointerGetAttributes`` needs to
+be handled properly on NVIDIA platform to get the correct memory type in CUDA,
+which is done in the file ``nvidia_hip_runtime_api.h``.
+
+So in any HIP applications which use HIP APIs involving memory types, developers
+should use ``#ifdef`` in order to assign the correct enum values depending on
+NVIDIA or AMD platform.
+
+As an example, please see the code from the `link `_.
+
+With the ``#ifdef`` condition, HIP APIs work as expected on both AMD and NVIDIA
+platforms.
+
+Note, ``cudaMemoryTypeUnregistered`` is currently not supported as
+``hipMemoryType`` enum, due to HIP functionality backward compatibility.
diff --git a/docs/how-to/hip_porting_guide.md b/docs/how-to/hip_porting_guide.md
deleted file mode 100644
index a6027d4801..0000000000
--- a/docs/how-to/hip_porting_guide.md
+++ /dev/null
@@ -1,582 +0,0 @@
-
-
-
-
-
-
-# HIP porting guide
-
-In addition to providing a portable C++ programming environment for GPUs, HIP is designed to ease
-the porting of existing CUDA code into the HIP environment. This section describes the available tools
-and provides practical suggestions on how to port CUDA code and work through common issues.
-
-## Porting a New CUDA Project
-
-### General Tips
-
-* Starting the port on a CUDA machine is often the easiest approach, since you can incrementally port pieces of the code to HIP while leaving the rest in CUDA. (Recall that on CUDA machines HIP is just a thin layer over CUDA, so the two code types can interoperate on NVCC platforms.) Also, the HIP port can be compared with the original CUDA code for function and performance.
-* Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the HIP code using the HIP compiler on an AMD machine.
-* HIP ports can replace CUDA versions: HIP can deliver the same performance as a native CUDA implementation, with the benefit of portability to both NVIDIA and AMD architectures as well as a path to future C++ standard support. You can handle platform-specific features through conditional compilation or by adding them to the open-source HIP infrastructure.
-* Use **[hipconvertinplace-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipconvertinplace-perl.sh)** to hipify all code files in the CUDA source directory.
-
-### Scanning existing CUDA code to scope the porting effort
-
-The **[hipexamine-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipexamine-perl.sh)** tool will scan a source directory to determine which files contain CUDA code and how much of that code can be automatically hipified.
-
-```shell
-> cd examples/rodinia_3.0/cuda/kmeans
-> $HIP_DIR/bin/hipexamine-perl.sh.
-info: hipify ./kmeans.h =====>
-info: hipify ./unistd.h =====>
-info: hipify ./kmeans.c =====>
-info: hipify ./kmeans_cuda_kernel.cu =====>
- info: converted 40 CUDA->HIP refs( dev:0 mem:0 kern:0 builtin:37 math:0 stream:0 event:0 err:0 def:0 tex:3 other:0 ) warn:0 LOC:185
-info: hipify ./getopt.h =====>
-info: hipify ./kmeans_cuda.cu =====>
- info: converted 49 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:0 math:0 stream:0 event:0 err:0 def:0 tex:12 other:0 ) warn:0 LOC:311
-info: hipify ./rmse.c =====>
-info: hipify ./cluster.c =====>
-info: hipify ./getopt.c =====>
-info: hipify ./kmeans_clustering.c =====>
-info: TOTAL-converted 89 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:37 math:0 stream:0 event:0 err:0 def:0 tex:15 other:0 ) warn:0 LOC:3607
- kernels (1 total) : kmeansPoint(1)
-```
-
-hipexamine-perl scans each code file (cpp, c, h, hpp, etc.) found in the specified directory:
-
-* Files with no CUDA code (`kmeans.h`) print one line summary just listing the source file name.
-* Files with CUDA code print a summary of what was found - for example the `kmeans_cuda_kernel.cu` file:
-
-```shell
-info: hipify ./kmeans_cuda_kernel.cu =====>
- info: converted 40 CUDA->HIP refs( dev:0 mem:0 kern:0 builtin:37 math:0 stream:0 event:0
-```
-
-* Interesting information in `kmeans_cuda_kernel.cu` :
- * How many CUDA calls were converted to HIP (40)
- * Breakdown of the CUDA functionality used (`dev:0 mem:0` etc). This file uses many CUDA builtins (37) and texture functions (3).
- * Warning for code that looks like CUDA API but was not converted (0 in this file).
- * Count Lines-of-Code (LOC) - 185 for this file.
-
-* hipexamine-perl also presents a summary at the end of the process for the statistics collected across all files. This has similar format to the per-file reporting, and also includes a list of all kernels which have been called. An example from above:
-
-```shell
-info: TOTAL-converted 89 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:37 math:0 stream:0 event:0 err:0 def:0 tex:15 other:0 ) warn:0 LOC:3607
- kernels (1 total) : kmeansPoint(1)
-```
-
-### Converting a project "in-place"
-
-```shell
-> hipify-perl --inplace
-```
-
-For each input file FILE, this script will:
-
-* If `FILE.prehip` file does not exist, copy the original code to a new file with extension `.prehip`. Then hipify the code file.
-* If `FILE.prehip` file exists, hipify `FILE.prehip` and save to FILE.
-
-This is useful for testing improvements to the hipify toolset.
-
-The [hipconvertinplace-perl.sh](https://github.com/ROCm/HIPIFY/blob/amd-staging/bin/hipconvertinplace-perl.sh) script will perform inplace conversion for all code files in the specified directory.
-This can be quite handy when dealing with an existing CUDA code base since the script preserves the existing directory structure
-and filenames - and includes work. After converting in-place, you can review the code to add additional parameters to
-directory names.
-
-```shell
-> hipconvertinplace-perl.sh MY_SRC_DIR
-```
-
-### Library Equivalents
-
-Most CUDA libraries have a corresponding ROCm library with similar functionality and APIs. However, ROCm also provides HIP marshalling libraries that greatly simplify the porting process because they more precisely reflect their CUDA counterparts and can be used with either the AMD or NVIDIA platforms (see "Identifying HIP Target Platform" below). There are a few notable exceptions:
-
-* MIOpen does not have a marshalling library interface to ease porting from cuDNN.
-* RCCL is a drop-in replacement for NCCL and implements the NCCL APIs.
-* hipBLASLt does not have a ROCm library but can still target the NVIDIA platform, as needed.
-* EIGEN's HIP support is part of the library.
-
-| CUDA Library | HIP Library | ROCm Library | Comment |
-|------------- | ----------- | ------------ | ------- |
-| cuBLAS | hipBLAS | rocBLAS | Basic Linear Algebra Subroutines
-| cuBLASLt | hipBLASLt | N/A | Basic Linear Algebra Subroutines, lightweight and new flexible API
-| cuFFT | hipFFT | rocFFT | Fast Fourier Transfer Library
-| cuSPARSE | hipSPARSE | rocSPARSE | Sparse BLAS + SPMV
-| cuSOLVER | hipSOLVER | rocSOLVER | Lapack library
-| AmgX | N/A | rocALUTION | Sparse iterative solvers and preconditioners with algebraic multigrid
-| Thrust | N/A | rocThrust | C++ parallel algorithms library
-| CUB | hipCUB | rocPRIM | Low Level Optimized Parallel Primitives
-| cuDNN | N/A | MIOpen | Deep learning Solver Library
-| cuRAND | hipRAND | rocRAND | Random Number Generator Library
-| EIGEN | EIGEN | N/A | C++ template library for linear algebra: matrices, vectors, numerical solvers,
-| NCCL | N/A | RCCL | Communications Primitives Library based on the MPI equivalents
-
-## Distinguishing Compiler Modes
-
-### Identifying HIP Target Platform
-
-All HIP projects target either AMD or NVIDIA platform. The platform affects which headers are included and which libraries are used for linking.
-
-* `__HIP_PLATFORM_AMD__` is defined if the HIP platform targets AMD.
-Note, `__HIP_PLATFORM_HCC__` was previously defined if the HIP platform targeted AMD, it is deprecated.
-* `__HIP_PLATFORM_NVDIA__` is defined if the HIP platform targets NVIDIA.
-Note, `__HIP_PLATFORM_NVCC__` was previously defined if the HIP platform targeted NVIDIA, it is deprecated.
-
-### Identifying the Compiler: hip-clang or NVCC
-
-Often, it's useful to know whether the underlying compiler is HIP-Clang or NVCC. This knowledge can guard platform-specific code or aid in platform-specific performance tuning.
-
-```cpp
-#ifdef __HIP_PLATFORM_AMD__
-// Compiled with HIP-Clang
-#endif
-```
-
-```cpp
-#ifdef __HIP_PLATFORM_NVIDIA__
-// Compiled with nvcc
-// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
-// Could be in pass-through mode to an underlying host compile OR (for example, a .cpp file)
-
-```
-
-```cpp
-#ifdef __CUDACC__
-// Compiled with nvcc (CUDA language extensions enabled)
-```
-
-Compiler directly generates the host code (using the Clang x86 target) and passes the code to another host compiler. Thus, they have no equivalent of the `__CUDACC__` define.
-
-### Identifying Current Compilation Pass: Host or Device
-
-NVCC makes two passes over the code: one for host code and one for device code.
-HIP-Clang will have multiple passes over the code: one for the host code, and one for each architecture on the device code.
-`__HIP_DEVICE_COMPILE__` is set to a nonzero value when the compiler (HIP-Clang or NVCC) is compiling code for a device inside a `__global__` kernel or for a device function. `__HIP_DEVICE_COMPILE__` can replace `#ifdef` checks on the `__CUDA_ARCH__` define.
-
-```cpp
-// #ifdef __CUDA_ARCH__
-#if __HIP_DEVICE_COMPILE__
-```
-
-Unlike `__CUDA_ARCH__`, the `__HIP_DEVICE_COMPILE__` value is 1 or undefined, and it doesn't represent the feature capability of the target device.
-
-### Compiler Defines: Summary
-
-|Define | HIP-Clang | NVCC | Other (GCC, ICC, Clang, etc.)
-|--- | --- | --- |--- |
-|HIP-related defines:|
-|`__HIP_PLATFORM_AMD__` | Defined | Undefined | Defined if targeting AMD platform; undefined otherwise |
-|`__HIP_PLATFORM_NVIDIA__` | Undefined | Defined | Defined if targeting NVIDIA platform; undefined otherwise |
-|`__HIP_DEVICE_COMPILE__` | 1 if compiling for device; undefined if compiling for host | 1 if compiling for device; undefined if compiling for host | Undefined
-|`__HIPCC__` | Defined | Defined | Undefined
-|`__HIP_ARCH_*` | 0 or 1 depending on feature support (see below) | 0 or 1 depending on feature support (see below) | 0
-|NVCC-related defines:|
-|`__CUDACC__` | Defined if source code is compiled by NVCC; undefined otherwise | Undefined
-|`__NVCC__` Undefined | Defined | Undefined
-|`__CUDA_ARCH__` | Undefined | Unsigned representing compute capability (e.g., "130") if in device code; 0 if in host code | Undefined
-|hip-clang-related defines:|
-|`__HIP__` | Defined | Undefined | Undefined
-|HIP-Clang common defines: |
-|`__clang__` | Defined | Defined | Undefined | Defined if using Clang; otherwise undefined
-
-## Identifying Architecture Features
-
-### HIP_ARCH Defines
-
-Some CUDA code tests `__CUDA_ARCH__` for a specific value to determine whether the machine supports a certain architectural feature. For instance,
-
-```cpp
-#if (__CUDA_ARCH__ >= 130)
-// doubles are supported
-```
-
-This type of code requires special attention, since AMD and CUDA devices have different architectural capabilities. Moreover, you can't determine the presence of a feature using a simple comparison against an architecture's version number. HIP provides a set of defines and device properties to query whether a specific architectural feature is supported.
-
-The `__HIP_ARCH_*` defines can replace comparisons of `__CUDA_ARCH__` values:
-
-```cpp
-//#if (__CUDA_ARCH__ >= 130) // non-portable
-if __HIP_ARCH_HAS_DOUBLES__ { // portable HIP feature query
- // doubles are supported
-}
-```
-
-For host code, the `__HIP_ARCH__*` defines are set to 0. You should only use the `__HIP_ARCH__` fields in device code.
-
-### Device-Architecture Properties
-
-Host code should query the architecture feature flags in the device properties that `hipGetDeviceProperties` returns, rather than testing the "major" and "minor" fields directly:
-
-```cpp
-hipGetDeviceProperties(&deviceProp, device);
-//if ((deviceProp.major == 1 && deviceProp.minor < 2)) // non-portable
-if (deviceProp.arch.hasSharedInt32Atomics) { // portable HIP feature query
- // has shared int32 atomic operations ...
-}
-```
-
-### Table of Architecture Properties
-
-The table below shows the full set of architectural properties that HIP supports.
-
-|Define (use only in device code) | Device Property (run-time query) | Comment |
-|------- | --------- | ----- |
-|32-bit atomics: | |
-|`__HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__` | `hasGlobalInt32Atomics` |32-bit integer atomics for global memory
-|`__HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__` | `hasGlobalFloatAtomicExch` |32-bit float atomic exchange for global memory
-|`__HIP_ARCH_HAS_SHARED_INT32_ATOMICS__` | `hasSharedInt32Atomics` |32-bit integer atomics for shared memory
-|`__HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__` | `hasSharedFloatAtomicExch` |32-bit float atomic exchange for shared memory
-|`__HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__` | `hasFloatAtomicAdd` |32-bit float atomic add in global and shared memory
-|64-bit atomics: | |
-|`__HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__` | `hasGlobalInt64Atomics` |64-bit integer atomics for global memory
-|`__HIP_ARCH_HAS_SHARED_INT64_ATOMICS__` | `hasSharedInt64Atomics` |64-bit integer atomics for shared memory
-|Doubles: | |
-|`__HIP_ARCH_HAS_DOUBLES__` | `hasDoubles` |Double-precision floating point
-|Warp cross-lane operations: | |
-|`__HIP_ARCH_HAS_WARP_VOTE__` | `hasWarpVote` |Warp vote instructions (`any`, `all`)
-|`__HIP_ARCH_HAS_WARP_BALLOT__` | `hasWarpBallot` |Warp ballot instructions
-|`__HIP_ARCH_HAS_WARP_SHUFFLE__` | `hasWarpShuffle` |Warp shuffle operations (`shfl_*`)
-|`__HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__` | `hasFunnelShift` |Funnel shift two input words into one
-|Sync: | |
-|`__HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__` | `hasThreadFenceSystem` |`threadfence_system`
-|`__HIP_ARCH_HAS_SYNC_THREAD_EXT__` | `hasSyncThreadsExt` |`syncthreads_count`, `syncthreads_and`, `syncthreads_or`
-|Miscellaneous: | |
-|`__HIP_ARCH_HAS_SURFACE_FUNCS__` | `hasSurfaceFuncs` |
-|`__HIP_ARCH_HAS_3DGRID__` | `has3dGrid` | Grids and groups are 3D
-|`__HIP_ARCH_HAS_DYNAMIC_PARALLEL__` | `hasDynamicParallelism` |
-
-## Finding HIP
-
-Makefiles can use the following syntax to conditionally provide a default HIP_PATH if one does not exist:
-
-```shell
-HIP_PATH ?= $(shell hipconfig --path)
-```
-
-## Identifying HIP Runtime
-
-HIP can depend on rocclr, or CUDA as runtime
-
-* AMD platform
-On AMD platform, HIP uses ROCm Compute Language Runtime, called ROCclr.
-ROCclr is a virtual device interface that HIP runtimes interact with different backends which allows runtimes to work on Linux , as well as Windows without much efforts.
-
-* NVIDIA platform
-On NVIDIA platform, HIP is just a thin layer on top of CUDA.
-
-The environment variable `HIP_PLATFORM` specifies the runtime to use. The
-platform is detected automatically by HIP. When an AMD graphics driver and an
-AMD GPU is detected, `HIP_PLATFORM` is set to `amd`. If both runtimes are
-installed, and a specific one should be used, or HIP can't detect the runtime,
-setting the environment variable manually tells `hipcc` what compilation path to
-choose. To use the CUDA compilation path, set the environment variable to
-`HIP_PLATFORM=nvidia`.
-
-## `hipLaunchKernelGGL`
-
-`hipLaunchKernelGGL` is a macro that can serve as an alternative way to launch kernel, which accepts parameters of launch configurations (grid dims, group dims, stream, dynamic shared size) followed by a variable number of kernel arguments.
-It can replace <<< >>>, if the user so desires.
-
-## Compiler Options
-
-hipcc is a portable compiler driver that will call NVCC or HIP-Clang (depending on the target system) and attach all required include and library options. It passes options through to the target compiler. Tools that call hipcc must ensure the compiler options are appropriate for the target compiler.
-The `hipconfig` script may helpful in identifying the target platform, compiler and runtime. It can also help set options appropriately.
-
-### Compiler options supported on AMD platforms
-
-Here are the main compiler options supported on AMD platforms by HIP-Clang.
-
-| Option | Description |
-| ------ | ----------- |
-| `--amdgpu-target=` | [DEPRECATED] This option is being replaced by `--offload-arch=`. Generate code for the given GPU target. Supported targets are gfx701, gfx801, gfx802, gfx803, gfx900, gfx906, gfx908, gfx1010, gfx1011, gfx1012, gfx1030, gfx1031. This option could appear multiple times on the same command line to generate a fat binary for multiple targets. |
-| `--fgpu-rdc` | Generate relocatable device code, which allows kernels or device functions calling device functions in different translation units. |
-| `-ggdb` | Equivalent to `-g` plus tuning for GDB. This is recommended when using ROCm's GDB to debug GPU code. |
-| `--gpu-max-threads-per-block=` | Generate code to support up to the specified number of threads per block. |
-| `-O` | Specify the optimization level. |
-| `-offload-arch=` | Specify the AMD GPU [target ID](https://clang.llvm.org/docs/ClangOffloadBundler.html#target-id). |
-| `-save-temps` | Save the compiler generated intermediate files. |
-| `-v` | Show the compilation steps. |
-
-## Linking Issues
-
-### Linking With hipcc
-
-hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (NVCC or AMD compiler). We recommend linking with hipcc since it automatically links the binary to the necessary HIP runtime libraries. It also has knowledge on how to link and to manage the GPU objects.
-
-### `-lm` Option
-
-hipcc adds `-lm` by default to the link command.
-
-## Linking Code With Other Compilers
-
-CUDA code often uses NVCC for accelerator code (defining and launching kernels, typically defined in `.cu` or `.cuh` files).
-It also uses a standard compiler (g++) for the rest of the application. NVCC is a preprocessor that employs a standard host compiler (gcc) to generate the host code.
-Code compiled using this tool can employ only the intersection of language features supported by both NVCC and the host compiler.
-In some cases, you must take care to ensure the data types and alignment of the host compiler are identical to those of the device compiler. Only some host compilers are supported---for example, recent NVCC versions lack Clang host-compiler capability.
-
-HIP-Clang generates both device and host code using the same Clang-based compiler. The code uses the same API as gcc, which allows code generated by different gcc-compatible compilers to be linked together. For example, code compiled using HIP-Clang can link with code compiled using "standard" compilers (such as gcc, ICC and Clang). Take care to ensure all compilers use the same standard C++ header and library formats.
-
-### libc++ and libstdc++
-
-hipcc links to libstdc++ by default. This provides better compatibility between g++ and HIP.
-
-If you pass `--stdlib=libc++` to hipcc, hipcc will use the libc++ library. Generally, libc++ provides a broader set of C++ features while libstdc++ is the standard for more compilers (notably including g++).
-
-When cross-linking C++ code, any C++ functions that use types from the C++ standard library (including std::string, std::vector and other containers) must use the same standard-library implementation. They include the following:
-
-* Functions or kernels defined in HIP-Clang that are called from a standard compiler
-* Functions defined in a standard compiler that are called from HIP-Clang.
-
-Applications with these interfaces should use the default libstdc++ linking.
-
-Applications which are compiled entirely with hipcc, and which benefit from advanced C++ features not supported in libstdc++, and which do not require portability to NVCC, may choose to use libc++.
-
-### HIP Headers (`hip_runtime.h`, `hip_runtime_api.h`)
-
-The `hip_runtime.h` and `hip_runtime_api.h` files define the types, functions and enumerations needed to compile a HIP program:
-
-* `hip_runtime_api.h`: defines all the HIP runtime APIs (e.g., `hipMalloc`) and the types required to call them. A source file that is only calling HIP APIs but neither defines nor launches any kernels can include `hip_runtime_api.h`. `hip_runtime_api.h` uses no custom Heterogeneous Compute (HC) language features and can be compiled using a standard C++ compiler.
-* `hip_runtime.h`: included in `hip_runtime_api.h`. It additionally provides the types and defines required to create and launch kernels. hip_runtime.h can be compiled using a standard C++ compiler but will expose a subset of the available functions.
-
-CUDA has slightly different contents for these two files. In some cases you may need to convert hipified code to include the richer `hip_runtime.h` instead of `hip_runtime_api.h`.
-
-### Using a Standard C++ Compiler
-
-You can compile `hip_runtime_api.h` using a standard C or C++ compiler (e.g., gcc or ICC). The HIP include paths and defines (`__HIP_PLATFORM_AMD__` or `__HIP_PLATFORM_NVIDIA__`) must pass to the standard compiler; `hipconfig` then returns the necessary options:
-
-```bash
-> hipconfig --cxx_config
- -D__HIP_PLATFORM_AMD__ -I/home/user1/hip/include
-```
-
-You can capture the `hipconfig` output and passed it to the standard compiler; below is a sample makefile syntax:
-
-```bash
-CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
-```
-
-NVCC includes some headers by default. However, HIP does not include default headers, and instead all required files must be explicitly included.
-Specifically, files that call HIP run-time APIs or define HIP kernels must explicitly include the appropriate HIP headers.
-If the compilation process reports that it cannot find necessary APIs (for example, `error: identifier hipSetDevice is undefined`),
-ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate).
-The hipify-perl script automatically converts `cuda_runtime.h` to `hip_runtime.h`, and it converts `cuda_runtime_api.h` to `hip_runtime_api.h`, but it may miss nested headers or macros.
-
-#### `cuda.h`
-
-The HIP-Clang path provides an empty `cuda.h` file. Some existing CUDA programs include this file but don't require any of the functions.
-
-### Choosing HIP File Extensions
-
-Many existing CUDA projects use the `.cu` and `.cuh` file extensions to indicate code that should be run through the NVCC compiler.
-For quick HIP ports, leaving these file extensions unchanged is often easier, as it minimizes the work required to change file names in the directory and #include statements in the files.
-
-For new projects or ports which can be re-factored, we recommend the use of the extension `.hip.cpp` for source files, and
-`.hip.h` or `.hip.hpp` for header files.
-This indicates that the code is standard C++ code, but also provides a unique indication for make tools to
-run hipcc when appropriate.
-
-## Workarounds
-
-### ``warpSize``
-
-Code should not assume a warp size of 32 or 64. See the
-:ref:`HIP language extension for warpSize ` for information on how
-to write portable wave-aware code.
-
-### Kernel launch with group size > 256
-
-Kernel code should use `__attribute__((amdgpu_flat_work_group_size(,)))`.
-
-For example:
-
-```cpp
-__global__ void dot(double *a,double *b,const int n) __attribute__((amdgpu_flat_work_group_size(1, 512)))
-```
-
-## `memcpyToSymbol`
-
-HIP support for `hipMemcpyToSymbol` is complete. This feature allows a kernel
-to define a device-side data symbol which can be accessed on the host side. The symbol
-can be in __constant or device space.
-
-Note that the symbol name needs to be encased in the HIP_SYMBOL macro, as shown in the code example below. This also applies to `hipMemcpyFromSymbol`, `hipGetSymbolAddress`, and `hipGetSymbolSize`.
-
-For example:
-
-Device Code:
-
-```cpp
-#include
-#include
-#include
-
-#define HIP_ASSERT(status) \
- assert(status == hipSuccess)
-
-#define LEN 512
-#define SIZE 2048
-
-__constant__ int Value[LEN];
-
-__global__ void Get(int *Ad)
-{
- int tid = threadIdx.x + blockIdx.x * blockDim.x;
- Ad[tid] = Value[tid];
-}
-
-int main()
-{
- int *A, *B, *Ad;
- A = new int[LEN];
- B = new int[LEN];
- for(unsigned i=0;i(&ptr), sizeof(double));
-hipPointerAttribute_t attr;
-hipPointerGetAttributes(&attr, ptr); /*attr.type will have value as hipMemoryTypeDevice*/
-
-double* ptrHost;
-hipHostMalloc(&ptrHost, sizeof(double));
-hipPointerAttribute_t attr;
-hipPointerGetAttributes(&attr, ptrHost); /*attr.type will have value as hipMemoryTypeHost*/
-```
-
-Please note, `hipMemoryType` enum values are different from `cudaMemoryType` enum values.
-
-For example, on AMD platform, `hipMemoryType` is defined in `hip_runtime_api.h`,
-
-```cpp
-typedef enum hipMemoryType {
- hipMemoryTypeHost = 0, ///< Memory is physically located on host
- hipMemoryTypeDevice = 1, ///< Memory is physically located on device. (see deviceId for specific device)
- hipMemoryTypeArray = 2, ///< Array memory, physically located on device. (see deviceId for specific device)
- hipMemoryTypeUnified = 3, ///< Not used currently
- hipMemoryTypeManaged = 4 ///< Managed memory, automaticallly managed by the unified memory system
-} hipMemoryType;
-```
-
-Looking into CUDA toolkit, it defines `cudaMemoryType` as following,
-
-```cpp
-enum cudaMemoryType
-{
- cudaMemoryTypeUnregistered = 0, // Unregistered memory.
- cudaMemoryTypeHost = 1, // Host memory.
- cudaMemoryTypeDevice = 2, // Device memory.
- cudaMemoryTypeManaged = 3, // Managed memory
-}
-```
-
-In this case, memory type translation for `hipPointerGetAttributes` needs to be handled properly on NVIDIA platform to get the correct memory type in CUDA, which is done in the file `nvidia_hip_runtime_api.h`.
-
-So in any HIP applications which use HIP APIs involving memory types, developers should use `#ifdef` in order to assign the correct enum values depending on NVIDIA or AMD platform.
-
-As an example, please see the code from the [link](https://github.com/ROCm/hip-tests/tree/develop/catch/unit/memory/hipMemcpyParam2D.cc).
-
-With the `#ifdef` condition, HIP APIs work as expected on both AMD and NVIDIA platforms.
-
-Note, `cudaMemoryTypeUnregstered` is currently not supported in `hipMemoryType` enum, due to HIP functionality backward compatibility.
-
-## `threadfence_system`
-
-`threadfence_system` makes all device memory writes, all writes to mapped host memory, and all writes to peer memory visible to CPU and other GPU devices.
-Some implementations can provide this behavior by flushing the GPU L2 cache.
-HIP/HIP-Clang does not provide this functionality. As a workaround, users can set the environment variable `HSA_DISABLE_CACHE=1` to disable the GPU L2 cache. This will affect all accesses and for all kernels and so may have a performance impact.
-
-### Textures and Cache Control
-
-Compute programs sometimes use textures either to access dedicated texture caches or to use the texture-sampling hardware for interpolation and clamping. The former approach uses simple point samplers with linear interpolation, essentially only reading a single point. The latter approach uses the sampler hardware to interpolate and combine multiple samples. AMD hardware, as well as recent competing hardware, has a unified texture/L1 cache, so it no longer has a dedicated texture cache. But the NVCC path often caches global loads in the L2 cache, and some programs may benefit from explicit control of the L1 cache contents. We recommend the `__ldg` instruction for this purpose.
-
-AMD compilers currently load all data into both the L1 and L2 caches, so `__ldg` is treated as a no-op.
-
-We recommend the following for functional portability:
-
-* For programs that use textures only to benefit from improved caching, use the `__ldg` instruction
-* Programs that use texture object and reference APIs, work well on HIP
-
-## More Tips
-
-### HIP Logging
-
-On an AMD platform, set the AMD_LOG_LEVEL environment variable to log HIP application execution information.
-
-The value of the setting controls different logging level,
-
-```cpp
-enum LogLevel {
-LOG_NONE = 0,
-LOG_ERROR = 1,
-LOG_WARNING = 2,
-LOG_INFO = 3,
-LOG_DEBUG = 4
-};
-```
-
-Logging mask is used to print types of functionalities during the execution of HIP application.
-It can be set as one of the following values,
-
-```cpp
-enum LogMask {
- LOG_API = 1, //!< (0x1) API call
- LOG_CMD = 2, //!< (0x2) Kernel and Copy Commands and Barriers
- LOG_WAIT = 4, //!< (0x4) Synchronization and waiting for commands to finish
- LOG_AQL = 8, //!< (0x8) Decode and display AQL packets
- LOG_QUEUE = 16, //!< (0x10) Queue commands and queue contents
- LOG_SIG = 32, //!< (0x20) Signal creation, allocation, pool
- LOG_LOCK = 64, //!< (0x40) Locks and thread-safety code.
- LOG_KERN = 128, //!< (0x80) Kernel creations and arguments, etc.
- LOG_COPY = 256, //!< (0x100) Copy debug
- LOG_COPY2 = 512, //!< (0x200) Detailed copy debug
- LOG_RESOURCE = 1024, //!< (0x400) Resource allocation, performance-impacting events.
- LOG_INIT = 2048, //!< (0x800) Initialization and shutdown
- LOG_MISC = 4096, //!< (0x1000) Misc debug, not yet classified
- LOG_AQL2 = 8192, //!< (0x2000) Show raw bytes of AQL packet
- LOG_CODE = 16384, //!< (0x4000) Show code creation debug
- LOG_CMD2 = 32768, //!< (0x8000) More detailed command info, including barrier commands
- LOG_LOCATION = 65536, //!< (0x10000) Log message location
- LOG_MEM = 131072, //!< (0x20000) Memory allocation
- LOG_MEM_POOL = 262144, //!< (0x40000) Memory pool allocation, including memory in graphs
- LOG_ALWAYS = -1 //!< (0xFFFFFFFF) Log always even mask flag is zero
-};
-```
-
-### Debugging hipcc
-
-To see the detailed commands that hipcc issues, set the environment variable HIPCC_VERBOSE to 1. Doing so will print to ``stderr`` the HIP-clang (or NVCC) commands that hipcc generates.
-
-```bash
-export HIPCC_VERBOSE=1
-make
-...
-hipcc-cmd: /opt/rocm/bin/hipcc --offload-arch=native -x hip backprop_cuda.cu
-```
-
-### Editor Highlighting
-
-See the utils/vim or utils/gedit directories to add handy highlighting to hip files.
diff --git a/docs/how-to/hip_porting_guide.rst b/docs/how-to/hip_porting_guide.rst
new file mode 100644
index 0000000000..8e2b0f2c5e
--- /dev/null
+++ b/docs/how-to/hip_porting_guide.rst
@@ -0,0 +1,713 @@
+.. meta::
+ :description: This chapter presents how to port CUDA source code to HIP.
+ :keywords: AMD, ROCm, HIP, CUDA, porting, port
+
+********************************************************************************
+HIP porting guide
+********************************************************************************
+
+HIP is designed to ease the porting of existing CUDA code into the HIP
+environment. This page describes the available tools and provides practical
+suggestions on how to port CUDA code and work through common issues.
+
+Porting a CUDA Project
+================================================================================
+
+Mixing HIP and CUDA code results in valid CUDA code. This enables users to
+incrementally port CUDA to HIP, and still compile and test the code during the
+transition.
+
+The only notable exception is ``hipError_t``, which is not just an alias to
+``cudaError_t``. In these cases HIP provides functions to convert between the
+error code spaces:
+
+* :cpp:func:`hipErrorToCudaError`
+* :cpp:func:`hipErrorToCUResult`
+* :cpp:func:`hipCUDAErrorTohipError`
+* :cpp:func:`hipCUResultTohipError`
+
+General Tips
+--------------------------------------------------------------------------------
+
+* Starting to port on an NVIDIA machine is often the easiest approach, as the
+ code can be tested for functionality and performance even if not fully ported
+ to HIP.
+* Once the CUDA code is ported to HIP and is running on the CUDA machine,
+ compile the HIP code for an AMD machine.
+* You can handle platform-specific features through conditional compilation or
+ by adding them to the open-source HIP infrastructure.
+* Use the `HIPIFY `_ tools to automatically
+ convert CUDA code to HIP, as described in the following section.
+
+HIPIFY
+--------------------------------------------------------------------------------
+
+:doc:`HIPIFY ` is a collection of tools that automatically
+translate CUDA to HIP code. There are two flavours available, ``hipfiy-clang``
+and ``hipify-perl``.
+
+:doc:`hipify-clang ` is, as the name implies, a Clang-based
+tool, and actually parses the code, translates it into an Abstract Syntax Tree,
+from which it then generates the HIP source. For this, ``hipify-clang`` needs to
+be able to actually compile the code, so the CUDA code needs to be correct, and
+a CUDA install with all necessary headers must be provided.
+
+:doc:`hipify-perl ` uses pattern matching, to translate the
+CUDA code to HIP. It does not require a working CUDA installation, and can also
+convert CUDA code, that is not syntactically correct. It is therefore easier to
+set up and use, but is not as powerful as ``hipfiy-clang``.
+
+Scanning existing CUDA code to scope the porting effort
+--------------------------------------------------------------------------------
+
+The ``--examine`` option, supported by the clang and perl version, tells hipify
+to do a test-run, without changing the files, but instead scan CUDA code to
+determine which files contain CUDA code and how much of that code can
+automatically be hipified.
+
+There also are ``hipexamine-perl.sh`` or ``hipexamine.sh`` (for
+``hipify-clang``) scripts to automatically scan directories.
+
+For example, the following is a scan of one of the
+`cuda-samples `_:
+
+.. code-block:: shell
+
+ > cd Samples/2_Concepts_and_Techniques/convolutionSeparable/
+ > hipexamine-perl.sh
+ [HIPIFY] info: file './convolutionSeparable.cu' statistics:
+ CONVERTED refs count: 2
+ TOTAL lines of code: 214
+ WARNINGS: 0
+ [HIPIFY] info: CONVERTED refs by names:
+ cooperative_groups.h => hip/hip_cooperative_groups.h: 1
+ cudaMemcpyToSymbol => hipMemcpyToSymbol: 1
+
+ [HIPIFY] info: file './main.cpp' statistics:
+ CONVERTED refs count: 13
+ TOTAL lines of code: 174
+ WARNINGS: 0
+ [HIPIFY] info: CONVERTED refs by names:
+ cudaDeviceSynchronize => hipDeviceSynchronize: 2
+ cudaFree => hipFree: 3
+ cudaMalloc => hipMalloc: 3
+ cudaMemcpy => hipMemcpy: 2
+ cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
+ cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1
+ cuda_runtime.h => hip/hip_runtime.h: 1
+
+ [HIPIFY] info: file 'GLOBAL' statistics:
+ CONVERTED refs count: 15
+ TOTAL lines of code: 512
+ WARNINGS: 0
+ [HIPIFY] info: CONVERTED refs by names:
+ cooperative_groups.h => hip/hip_cooperative_groups.h: 1
+ cudaDeviceSynchronize => hipDeviceSynchronize: 2
+ cudaFree => hipFree: 3
+ cudaMalloc => hipMalloc: 3
+ cudaMemcpy => hipMemcpy: 2
+ cudaMemcpyDeviceToHost => hipMemcpyDeviceToHost: 1
+ cudaMemcpyHostToDevice => hipMemcpyHostToDevice: 1
+ cudaMemcpyToSymbol => hipMemcpyToSymbol: 1
+ cuda_runtime.h => hip/hip_runtime.h: 1
+
+``hipexamine-perl.sh`` reports how many CUDA calls are going to be converted to
+HIP (e.g. ``CONVERTED refs count: 2``), and lists them by name together with
+their corresponding HIP-version (see the lines following ``[HIPIFY] info:
+CONVERTED refs by names:``). It also lists the total lines of code for the file
+and potential warnings. In the end it prints a summary for all files.
+
+Automatically converting a CUDA project
+--------------------------------------------------------------------------------
+
+To directly replace the files, the ``--inplace`` option of ``hipify-perl`` or
+``hipify-clang`` can be used. This creates a backup of the original files in a
+``.prehip`` file and overwrites the existing files, keeping their file
+endings. If the ``--inplace`` option is not given, the scripts print the
+hipified code to ``stdout``.
+
+``hipconvertinplace.sh``or ``hipconvertinplace-perl.sh`` operate on whole
+directories.
+
+Library Equivalents
+--------------------------------------------------------------------------------
+
+ROCm provides libraries to ease porting of code relying on CUDA libraries.
+Most CUDA libraries have a corresponding HIP library.
+
+There are two flavours of libraries provided by ROCm, ones prefixed with ``hip``
+and ones prefixed with ``roc``. While both are written using HIP, in general
+only the ``hip``-libraries are portable. The libraries with the ``roc``-prefix
+might also run on CUDA-capable GPUs, however they have been optimized for AMD
+GPUs and might use assembly code or a different API, to achieve the best
+performance.
+
+.. note::
+
+ If the application is only required to run on AMD GPUs, it is recommended to
+ use the ``roc``-libraries.
+
+In the case where a library provides a ``roc``- and a ``hip``- version, the
+``hip`` version is a marshalling library, which is just a thin layer that is
+redirecting the function calls to either the ``roc``-library or the
+corresponding CUDA library, depending on the platform, to provide compatibility.
+
+.. list-table::
+ :header-rows: 1
+
+ *
+ - CUDA Library
+ - ``hip`` Library
+ - ``roc`` Library
+ - Comment
+ *
+ - cuBLAS
+ - `hipBLAS `_
+ - `rocBLAS `_
+ - Basic Linear Algebra Subroutines
+ *
+ - cuBLASLt
+ - `hipBLASLt `_
+ -
+ - Linear Algebra Subroutines, lightweight and new flexible API
+ *
+ - cuFFT
+ - `hipFFT `_
+ - `rocFFT `_
+ - Fast Fourier Transfer Library
+ *
+ - cuSPARSE
+ - `hipSPARSE `_
+ - `rocSPARSE `_
+ - Sparse BLAS + SPMV
+ *
+ - cuSOLVER
+ - `hipSOLVER `_
+ - `rocSOLVER `_
+ - Lapack library
+ *
+ - AmgX
+ -
+ - `rocALUTION `_
+ - Sparse iterative solvers and preconditioners with algebraic multigrid
+ *
+ - Thrust
+ -
+ - `rocThrust `_
+ - C++ parallel algorithms library
+ *
+ - CUB
+ - `hipCUB `_
+ - `rocPRIM `_
+ - Low Level Optimized Parallel Primitives
+ *
+ - cuDNN
+ -
+ - `MIOpen `_
+ - Deep learning Solver Library
+ *
+ - cuRAND
+ - `hipRAND `_
+ - `rocRAND `_
+ - Random Number Generator Library
+ *
+ - NCCL
+ -
+ - `RCCL `_
+ - Communications Primitives Library based on the MPI equivalents
+ RCCL is a drop-in replacement for NCCL
+
+Distinguishing compilers and platforms
+================================================================================
+
+Identifying the HIP Target Platform
+--------------------------------------------------------------------------------
+
+HIP projects can target either the AMD or NVIDIA platform. The platform affects
+which backend-headers are included and which libraries are used for linking. The
+created binaries are not portable between AMD and NVIDIA platforms.
+
+To write code that is specific to a platform the C++-macros specified in the
+following section can be used.
+
+Compiler Defines: Summary
+--------------------------------------------------------------------------------
+
+This section lists macros that are defined by compilers and the HIP/CUDA APIs,
+and what compiler/platform combinations they are defined for.
+
+The following table lists the macros that can be used when compiling HIP. Most
+of these macros are not directly defined by the compilers, but in
+``hip_common.h``, which is included by ``hip_runtime.h``.
+
+.. list-table:: HIP-related defines
+ :header-rows: 1
+
+ *
+ - Macro
+ - ``amdclang++``
+ - ``nvcc`` when used as backend for ``hipcc``
+ - Other (GCC, ICC, Clang, etc.)
+ *
+ - ``__HIP_PLATFORM_AMD__``
+ - Defined
+ - Undefined
+ - Undefined, needs to be set explicitly
+ *
+ - ``__HIP_PLATFORM_NVIDIA__``
+ - Undefined
+ - Defined
+ - Undefined, needs to be set explicitly
+ *
+ - ``__HIPCC__``
+ - Defined when compiling ``.hip`` files or specifying ``-x hip``
+ - Defined when compiling ``.hip`` files or specifying ``-x hip``
+ - Undefined
+ *
+ - ``__HIP_DEVICE_COMPILE__``
+ - 1 if compiling for device
+ undefined if compiling for host
+ - 1 if compiling for device
+ undefined if compiling for host
+ - Undefined
+ *
+ - ``__HIP_ARCH___``
+ - 0 or 1 depending on feature support of targeted hardware (see :ref:`identifying_device_architecture_features`)
+ - 0 or 1 depending on feature support of targeted hardware
+ - 0
+ *
+ - ``__HIP__``
+ - Defined when compiling ``.hip`` files or specifying ``-x hip``
+ - Undefined
+ - Undefined
+
+The following table lists macros related to ``nvcc`` and CUDA as HIP backend.
+
+.. list-table:: NVCC-related defines
+ :header-rows: 1
+
+ *
+ - Macro
+ - ``amdclang++``
+ - ``nvcc`` when used as backend for ``hipcc``
+ - Other (GCC, ICC, Clang, etc.)
+ *
+ - ``__CUDACC__``
+ - Undefined
+ - Defined
+ - Undefined
+ (Clang defines this when explicitly compiling CUDA code)
+ *
+ - ``__NVCC__``
+ - Undefined
+ - Defined
+ - Undefined
+ *
+ - ``__CUDA_ARCH__`` [#cuda_arch]_
+ - Undefined
+ - Defined in device code
+ Integer representing compute capability
+ Must not be used in host code
+ - Undefined
+
+.. [#cuda_arch] the use of ``__CUDA_ARCH__`` to check for hardware features is
+ discouraged, as this is not portable. Use the ``__HIP_ARCH_HAS_``
+ macros instead.
+
+Identifying the compilation target platform
+--------------------------------------------------------------------------------
+
+Despite HIP's portability, it can be necessary to tailor code to a specific
+platform, in order to provide platform-specific code, or aid in
+platform-specific performance improvements.
+
+For this, the ``__HIP_PLATFORM_AMD__`` and ``__HIP_PLATFORM_NVIDIA__`` macros
+can be used, e.g.:
+
+.. code-block:: cpp
+
+ #ifdef __HIP_PLATFORM_AMD__
+ // This code path is compiled when amdclang++ is used for compilation
+ #endif
+
+.. code-block:: cpp
+
+ #ifdef __HIP_PLATFORM_NVIDIA__
+ // This code path is compiled when nvcc is used for compilation
+ // Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
+ // Could be in pass-through mode to an underlying host compiler (for example, a .cpp file)
+ #endif
+
+When using ``hipcc``, the environment variable ``HIP_PLATFORM`` specifies the
+runtime to use. When an AMD graphics driver and an AMD GPU is detected,
+``HIP_PLATFORM`` is set to ``amd``. If both runtimes are installed, and a
+specific one should be used, or ``hipcc`` can't detect the runtime, the
+environment variable has to be set manually.
+
+To explicitly use the CUDA compilation path, use:
+
+.. code-block:: bash
+
+ export HIP_PLATFORM=nvidia
+ hipcc main.cpp
+
+Identifying Host or Device Compilation Pass
+--------------------------------------------------------------------------------
+
+``amdclang++`` makes multiple passes over the code: one for the host code, and
+one each for the device code for every GPU architecture to be compiled for.
+``nvcc`` makes two passes over the code: one for host code and one for device
+code.
+
+The ``__HIP_DEVICE_COMPILE__``-macro is defined when the compiler is compiling
+for the device.
+
+
+``__HIP_DEVICE_COMPILE__`` is a portable check that can replace the
+``__CUDA_ARCH__``.
+
+.. code-block:: cpp
+
+ #include "hip/hip_runtime.h"
+ #include
+
+ __host__ __device__ void call_func(){
+ #ifdef __HIP_DEVICE_COMPILE__
+ printf("device\n");
+ #else
+ std::cout << "host" << std::endl;
+ #endif
+ }
+
+ __global__ void test_kernel(){
+ call_func();
+ }
+
+ int main(int argc, char** argv) {
+ test_kernel<<<1, 1, 0, 0>>>();
+
+ call_func();
+ }
+
+.. _identifying_device_architecture_features:
+
+Identifying Device Architecture Features
+================================================================================
+
+GPUs of different generations and architectures do not all provide the same
+level of :doc:`hardware feature support <../reference/hardware_features>`. To
+guard device-code using these architecture dependent features, the
+``__HIP_ARCH___`` C++-macros can be used.
+
+Device Code Feature Identification
+--------------------------------------------------------------------------------
+
+Some CUDA code tests ``__CUDA_ARCH__`` for a specific value to determine whether
+the GPU supports a certain architectural feature, depending on its compute
+capability. This requires knowledge about what ``__CUDA_ARCH__`` supports what
+feature set.
+
+HIP simplifies this, by replacing these macros with feature-specific macros, not
+architecture specific.
+
+For instance,
+
+.. code-block:: cpp
+
+ //#if __CUDA_ARCH__ >= 130 // does not properly specify, what feature is required, not portable
+ #if __HIP_ARCH_HAS_DOUBLES__ == 1 // explicitly specifies, what feature is required, portable between AMD and NVIDIA GPUs
+ // device code
+ #endif
+
+For host code, the ``__HIP_ARCH___`` defines are set to 0, if
+``hip_runtime.h`` is included, and undefined otherwise. It should not be relied
+upon in host code.
+
+Host Code Feature Identification
+--------------------------------------------------------------------------------
+
+Host code must not rely on the ``__HIP_ARCH___`` macros, as the GPUs
+available to a system can not be known during compile time, and their
+architectural features differ.
+
+Host code can query architecture feature flags during runtime, by using
+:cpp:func:`hipGetDeviceProperties` or :cpp:func:`hipDeviceGetAttribute`.
+
+.. code-block:: cpp
+
+ #include
+ #include
+ #include
+
+ #define HIP_CHECK(expression) { \
+ const hipError_t err = expression; \
+ if (err != hipSuccess){ \
+ std::cout << "HIP Error: " << hipGetErrorString(err)) \
+ << " at line " << __LINE__ << std::endl; \
+ std::exit(EXIT_FAILURE); \
+ } \
+ }
+
+ int main(){
+ int deviceCount;
+ HIP_CHECK(hipGetDeviceCount(&deviceCount));
+
+ int device = 0; // Query first available GPU. Can be replaced with any
+ // integer up to, not including, deviceCount
+ hipDeviceProp_t deviceProp;
+ HIP_CHECK(hipGetDeviceProperties(&deviceProp, device));
+
+ std::cout << "The queried device ";
+ if (deviceProp.arch.hasSharedInt32Atomics) // portable HIP feature query
+ std::cout << "supports";
+ else
+ std::cout << "does not support";
+ std::cout << " shared int32 atomic operations" << std::endl;
+ }
+
+Table of Architecture Properties
+--------------------------------------------------------------------------------
+
+The table below shows the full set of architectural properties that HIP
+supports, together with the corresponding macros and device properties.
+
+.. list-table::
+ :header-rows: 1
+
+ *
+ - Macro (for device code)
+ - Device Property (host runtime query)
+ - Comment
+ *
+ - ``__HIP_ARCH_HAS_GLOBAL_INT32_ATOMICS__``
+ - ``hasGlobalInt32Atomics``
+ - 32-bit integer atomics for global memory
+ *
+ - ``__HIP_ARCH_HAS_GLOBAL_FLOAT_ATOMIC_EXCH__``
+ - ``hasGlobalFloatAtomicExch``
+ - 32-bit float atomic exchange for global memory
+ *
+ - ``__HIP_ARCH_HAS_SHARED_INT32_ATOMICS__``
+ - ``hasSharedInt32Atomics``
+ - 32-bit integer atomics for shared memory
+ *
+ - ``__HIP_ARCH_HAS_SHARED_FLOAT_ATOMIC_EXCH__``
+ - ``hasSharedFloatAtomicExch``
+ - 32-bit float atomic exchange for shared memory
+ *
+ - ``__HIP_ARCH_HAS_FLOAT_ATOMIC_ADD__``
+ - ``hasFloatAtomicAdd``
+ - 32-bit float atomic add in global and shared memory
+ *
+ - ``__HIP_ARCH_HAS_GLOBAL_INT64_ATOMICS__``
+ - ``hasGlobalInt64Atomics``
+ - 64-bit integer atomics for global memory
+ *
+ - ``__HIP_ARCH_HAS_SHARED_INT64_ATOMICS__``
+ - ``hasSharedInt64Atomics``
+ - 64-bit integer atomics for shared memory
+ *
+ - ``__HIP_ARCH_HAS_DOUBLES__``
+ - ``hasDoubles``
+ - Double-precision floating-point operations
+ *
+ - ``__HIP_ARCH_HAS_WARP_VOTE__``
+ - ``hasWarpVote``
+ - Warp vote instructions (``any``, ``all``)
+ *
+ - ``__HIP_ARCH_HAS_WARP_BALLOT__``
+ - ``hasWarpBallot``
+ - Warp ballot instructions
+ *
+ - ``__HIP_ARCH_HAS_WARP_SHUFFLE__``
+ - ``hasWarpShuffle``
+ - Warp shuffle operations (``shfl_*``)
+ *
+ - ``__HIP_ARCH_HAS_WARP_FUNNEL_SHIFT__``
+ - ``hasFunnelShift``
+ - Funnel shift two input words into one
+ *
+ - ``__HIP_ARCH_HAS_THREAD_FENCE_SYSTEM__``
+ - ``hasThreadFenceSystem``
+ - :cpp:func:`threadfence_system`
+ *
+ - ``__HIP_ARCH_HAS_SYNC_THREAD_EXT__``
+ - ``hasSyncThreadsExt``
+ - :cpp:func:`syncthreads_count`, :cpp:func:`syncthreads_and`, :cpp:func:`syncthreads_or`
+ *
+ - ``__HIP_ARCH_HAS_SURFACE_FUNCS__``
+ - ``hasSurfaceFuncs``
+ - Supports :ref:`surface functions `.
+ *
+ - ``__HIP_ARCH_HAS_3DGRID__``
+ - ``has3dGrid``
+ - Grids and groups are 3D
+ *
+ - ``__HIP_ARCH_HAS_DYNAMIC_PARALLEL__``
+ - ``hasDynamicParallelism``
+ - Ability to launch a kernel from within a kernel
+
+Compilation
+================================================================================
+
+``hipcc`` is a portable compiler driver that calls ``nvcc`` or ``amdclang++``
+and forwards the appropriate options. It passes options through
+to the target compiler. Tools that call ``hipcc`` must ensure the compiler
+options are appropriate for the target compiler.
+
+``hipconfig`` is a helpful tool in identifying the current systems platform,
+compiler and runtime. It can also help set options appropriately.
+
+As an example, it can provide a path to HIP, in Makefiles for example:
+
+.. code-block:: shell
+
+ HIP_PATH ?= $(shell hipconfig --path)
+
+HIP Headers
+--------------------------------------------------------------------------------
+
+The ``hip_runtime.h`` headers define all the necessary types, functions, macros,
+etc., needed to compile a HIP program, this includes host as well as device
+code. ``hip_runtime_api.h`` is a subset of ``hip_runtime.h``.
+
+CUDA has slightly different contents for these two files. In some cases you may
+need to convert hipified code to include the richer ``hip_runtime.h`` instead of
+``hip_runtime_api.h``.
+
+Using a Standard C++ Compiler
+--------------------------------------------------------------------------------
+
+You can compile ``hip_runtime_api.h`` using a standard C or C++ compiler
+(e.g., ``gcc`` or ``icc``).
+A source file that is only calling HIP APIs but neither defines nor launches any
+kernels can be compiled with a standard host compiler (e.g. ``gcc`` or ``icc``)
+even when ``hip_runtime_api.h`` or ``hip_runtime.h`` are included.
+
+The HIP include paths and platform macros (``__HIP_PLATFORM_AMD__`` or
+``__HIP_PLATFORM_NVIDIA__``) must be passed to the compiler.
+
+``hipconfig`` can help in finding the necessary options, for example on an AMD
+platform:
+
+.. code-block:: bash
+
+ hipconfig --cpp_config
+ -D__HIP_PLATFORM_AMD__= -I/opt/rocm/include
+
+``nvcc`` includes some headers by default. ``hipcc`` does not include
+default headers, and instead all required files must be explicitly included.
+
+The ``hipify`` tool automatically converts ``cuda_runtime.h`` to
+``hip_runtime.h``, and it converts ``cuda_runtime_api.h`` to
+``hip_runtime_api.h``, but it may miss nested headers or macros.
+
+warpSize
+================================================================================
+
+Code should not assume a warp size of 32 or 64, as that is not portable between
+platforms and architectures. The ``warpSize`` built-in should be used in device
+code, while the host can query it during runtime via the device properties. See
+the :ref:`HIP language extension for warpSize ` for information on
+how to write portable wave-aware code.
+
+Lane masks bit-shift
+================================================================================
+
+A thread in a warp is also called a lane, and a lane mask is a bitmask where
+each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
+0 if it's inactive. Bit-shift operations are typically used to create lane masks
+and on AMD GPUs the ``warpSize`` can differ between different architectures,
+that's why it's essential to use correct bitmask type, when porting code.
+
+Example:
+
+.. code-block:: cpp
+
+ // Get the thread's position in the warp
+ unsigned int laneId = threadIdx.x % warpSize;
+
+ // Use lane ID for bit-shift
+ val & ((1 << (threadIdx.x % warpSize) )-1 );
+
+ // Shift 32 bit integer with val variable
+ WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
+
+Lane masks are 32-bit integer types as this is the integer precision that C
+assigns to such constants by default. GCN/CDNA architectures have a warp size of
+64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain
+values greater than 31. Consequently, shifting by such values would clear the
+32-bit register to which the shift operation is applied. For AMD
+architectures, a straightforward fix could look as follows:
+
+.. code-block:: cpp
+
+ // Get the thread's position in the warp
+ unsigned int laneId = threadIdx.x % warpSize;
+
+ // Use lane ID for bit-shift
+ val & ((1ull << (threadIdx.x % warpSize) )-1 );
+
+ // Shift 64 bit integer with val variable
+ WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
+
+For portability reasons, it is better to introduce appropriately
+typed placeholders as shown below:
+
+.. code-block:: cpp
+
+ #if defined(__GFX8__) || defined(__GFX9__)
+ typedef uint64_t lane_mask_t;
+ #else
+ typedef uint32_t lane_mask_t;
+ #endif
+
+The use of :code:`lane_mask_t` with the previous example:
+
+.. code-block:: cpp
+
+ // Get the thread's position in the warp
+ unsigned int laneId = threadIdx.x % warpSize;
+
+ // Use lane ID for bit-shift
+ val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
+
+ // Shift 32 or 64 bit integer with val variable
+ WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
+
+Porting from CUDA __launch_bounds__
+================================================================================
+
+CUDA also defines a ``__launch_bounds__`` qualifier which works similar to HIP's
+implementation, however it uses different parameters:
+
+.. code-block:: cpp
+
+ __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MULTIPROCESSOR)
+
+The first parameter is the same as HIP's implementation, but
+``MIN_BLOCKS_PER_MULTIPROCESSOR`` must be converted to
+``MIN_WARPS_PER_EXECUTION``, which uses warps and execution units rather than
+blocks and multiprocessors. This conversion is performed automatically by
+:doc:`HIPIFY `, or can be done manually with the following
+equation.
+
+.. code-block:: cpp
+
+ MIN_WARPS_PER_EXECUTION_UNIT = (MIN_BLOCKS_PER_MULTIPROCESSOR * MAX_THREADS_PER_BLOCK) / warpSize
+
+Directly controlling the warps per execution unit makes it easier to reason
+about the occupancy, unlike with blocks, where the occupancy depends on the
+block size.
+
+The use of execution units rather than multiprocessors also provides support for
+architectures with multiple execution units per multiprocessor. For example, the
+AMD GCN architecture has 4 execution units per multiprocessor.
+
+maxregcount
+--------------------------------------------------------------------------------
+
+Unlike ``nvcc``, ``amdclang++`` does not support the ``--maxregcount`` option.
+Instead, users are encouraged to use the ``__launch_bounds__`` directive since
+the parameters are more intuitive and portable than micro-architecture details
+like registers. The directive allows per-kernel control.
diff --git a/docs/how-to/hip_rtc.md b/docs/how-to/hip_rtc.md
deleted file mode 100644
index 14584828be..0000000000
--- a/docs/how-to/hip_rtc.md
+++ /dev/null
@@ -1,535 +0,0 @@
-
-
-
-
-
-
-# Programming for HIP runtime compiler (RTC)
-
-HIP lets you compile kernels at runtime with the `hiprtc*` APIs.
-Kernels can be stored as a text string and can be passed to HIPRTC APIs alongside options to guide the compilation.
-
-:::{note}
-
-* This library can be used on systems without HIP installed nor AMD GPU driver installed at all (offline compilation). Therefore, it doesn't depend on any HIP runtime library.
-* This library depends on Code Object Manager (comgr). You can try to statically link comgr into HIPRTC to avoid ambiguity.
-* Developers can bundle this library with their application.
-
-:::
-
-## Compilation APIs
-
-To use HIPRTC functionality, HIPRTC header needs to be included first.
-`#include `
-
-Kernels can be stored in a string:
-
-```cpp
-static constexpr auto kernel_source {
-R"(
- extern "C"
- __global__ void vector_add(float* output, float* input1, float* input2, size_t size) {
- int i = threadIdx.x;
- if (i < size) {
- output[i] = input1[i] + input2[i];
- }
- }
-)"};
-```
-
-Now to compile this kernel, it needs to be associated with `hiprtcProgram` type, which is done by declaring `hiprtcProgram prog;` and associating the string of kernel with this program:
-
-```cpp
-hiprtcCreateProgram(&prog, // HIPRTC program handle
- kernel_source, // HIP kernel source string
- "vector_add.cpp", // Name of the HIP program, can be null or an empty string
- 0, // Number of headers
- NULL, // Header sources
- NULL); // Name of header files
-```
-
-`hiprtcCreateProgram` API also allows you to add headers which can be included in your RTC program.
-For online compilation, the compiler pre-defines HIP device API functions, HIP specific types and macros for device compilation, but does not include standard C/C++ headers by default. Users can only include header files provided to `hiprtcCreateProgram`.
-
-After associating the kernel string with `hiprtcProgram`, you can now compile this program using:
-
-```cpp
-hiprtcCompileProgram(prog, // hiprtcProgram
- 0, // Number of options
- options); // Clang Options [Supported Clang Options](clang_options.md)
-```
-
-`hiprtcCompileProgram` returns a status value which can be converted to string via `hiprtcGetErrorString`. If compilation is successful, `hiprtcCompileProgram` will return `HIPRTC_SUCCESS`.
-
-If the compilation fails, you can look up the logs via:
-
-```cpp
-size_t logSize;
-hiprtcGetProgramLogSize(prog, &logSize);
-
-if (logSize) {
- string log(logSize, '\0');
- hiprtcGetProgramLog(prog, &log[0]);
- // Corrective action with logs
-}
-```
-
-If the compilation is successful, you can load the compiled binary in a local variable.
-
-```cpp
-size_t codeSize;
-hiprtcGetCodeSize(prog, &codeSize);
-
-vector kernel_binary(codeSize);
-hiprtcGetCode(prog, kernel_binary.data());
-```
-
-After loading the binary, `hiprtcProgram` can be destroyed.
-`hiprtcDestroyProgram(&prog);`
-
-The binary present in `kernel_binary` can now be loaded via `hipModuleLoadData` API.
-
-```cpp
-hipModule_t module;
-hipFunction_t kernel;
-
-hipModuleLoadData(&module, kernel_binary.data());
-hipModuleGetFunction(&kernel, module, "vector_add");
-```
-
-And now this kernel can be launched via `hipModule` APIs.
-
-The full example is below:
-
-```cpp
-#include
-#include
-
-#include
-#include
-#include
-
-#define CHECK_RET_CODE(call, ret_code) \
- { \
- if ((call) != ret_code) { \
- std::cout << "Failed in call: " << #call << std::endl; \
- std::abort(); \
- } \
- }
-#define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
-#define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
-
-// source code for hiprtc
-static constexpr auto kernel_source{
- R"(
- extern "C"
- __global__ void vector_add(float* output, float* input1, float* input2, size_t size) {
- int i = threadIdx.x;
- if (i < size) {
- output[i] = input1[i] + input2[i];
- }
- }
-)"};
-
-int main() {
- hiprtcProgram prog;
- auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
- kernel_source, // kernel source string
- "vector_add.cpp", // Name of the file
- 0, // Number of headers
- NULL, // Header sources
- NULL); // Name of header file
-
- if (rtc_ret_code != HIPRTC_SUCCESS) {
- std::cout << "Failed to create program" << std::endl;
- std::abort();
- }
-
- hipDeviceProp_t props;
- int device = 0;
- HIP_CHECK(hipGetDeviceProperties(&props, device));
- std::string sarg = std::string("--gpu-architecture=") +
- props.gcnArchName; // device for which binary is to be generated
-
- const char* options[] = {sarg.c_str()};
-
- rtc_ret_code = hiprtcCompileProgram(prog, // hiprtcProgram
- 0, // Number of options
- options); // Clang Options
- if (rtc_ret_code != HIPRTC_SUCCESS) {
- std::cout << "Failed to create program" << std::endl;
- std::abort();
- }
-
- size_t logSize;
- HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
-
- if (logSize) {
- std::string log(logSize, '\0');
- HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
- std::cout << "Compilation failed with: " << log << std::endl;
- std::abort();
- }
-
- size_t codeSize;
- HIPRTC_CHECK(hiprtcGetCodeSize(prog, &codeSize));
-
- std::vector kernel_binary(codeSize);
- HIPRTC_CHECK(hiprtcGetCode(prog, kernel_binary.data()));
-
- HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
-
- hipModule_t module;
- hipFunction_t kernel;
-
- HIP_CHECK(hipModuleLoadData(&module, kernel_binary.data()));
- HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
-
- constexpr size_t ele_size = 256; // total number of items to add
- std::vector hinput, output;
- hinput.reserve(ele_size);
- output.reserve(ele_size);
- for (size_t i = 0; i < ele_size; i++) {
- hinput.push_back(static_cast(i + 1));
- output.push_back(0.0f);
- }
-
- float *dinput1, *dinput2, *doutput;
- HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
- HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
- HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
-
- HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
- HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
-
- struct {
- float* output;
- float* input1;
- float* input2;
- size_t size;
- } args{doutput, dinput1, dinput2, ele_size};
-
- auto size = sizeof(args);
- void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
- HIP_LAUNCH_PARAM_END};
-
- HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
-
- HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
-
- for (size_t i = 0; i < ele_size; i++) {
- if ((hinput[i] + hinput[i]) != output[i]) {
- std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i]
- << std::endl;
- std::abort();
- }
- }
- std::cout << "Passed" << std::endl;
-
- HIP_CHECK(hipFree(dinput1));
- HIP_CHECK(hipFree(dinput2));
- HIP_CHECK(hipFree(doutput));
-}
-```
-
-## Kernel Compilation Cache
-
-HIPRTC incorporates a cache to avoid recompiling kernels between program executions. The contents of the cache include the kernel source code (including the contents of any `#include` headers), the compilation flags, and the compiler version. After a ROCm version update, the kernels are progressively recompiled, and the new results are cached. When the cache is disabled, each kernel is recompiled every time it is requested.
-
-Use the following environment variables to manage the cache status as enabled or disabled, the location for storing the cache contents, and the cache eviction policy:
-
-* `AMD_COMGR_CACHE` By default this variable has a value of `0` and the compilation cache feature is disabled. To enable the feature set the environment variable to a value of `1` (or any value other than `0`). This behavior may change in a future release.
-
-* `AMD_COMGR_CACHE_DIR`: By default the value of this environment variable is defined as `$XDG_CACHE_HOME/comgr_cache`, which defaults to `$USER/.cache/comgr_cache` on Linux, and `%LOCALAPPDATA%\cache\comgr_cache` on Windows. You can specify a different directory for the environment variable to change the path for cache storage. If the runtime fails to access the specified cache directory, or the environment variable is set to an empty string (""), the cache is disabled.
-
-* `AMD_COMGR_CACHE_POLICY`: If assigned a value, the string is interpreted and applied to the cache pruning policy. The string format is consistent with [Clang's ThinLTO cache pruning policy](https://rocm.docs.amd.com/projects/llvm-project/en/latest/LLVM/clang/html/ThinLTO.html#cache-pruning). The default policy is defined as: `prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0`. If the runtime fails to parse the defined string, or the environment variable is set to an empty string (""), the cache is disabled.
-
-:::{note}
- This cache is also shared with the OpenCL runtime shipped with ROCm.
-:::
-
-## HIPRTC specific options
-
-HIPRTC provides a few HIPRTC specific flags
-
-* `--gpu-architecture` : This flag can guide the code object generation for a specific gpu arch. Example: `--gpu-architecture=gfx906:sramecc+:xnack-`, its equivalent to `--offload-arch`.
- * This option is compulsory if compilation is done on a system without AMD GPUs supported by HIP runtime.
- * Otherwise, HIPRTC will load the hip runtime and gather the current device and its architecture info and use it as option.
-* `-fgpu-rdc` : This flag when provided during the `hiprtcCompileProgram` generates the bitcode (HIPRTC doesn't convert this bitcode into ISA and binary). This bitcode can later be fetched using `hiprtcGetBitcode` and `hiprtcGetBitcodeSize` APIs.
-
-### Bitcode
-
-In the usual scenario, the kernel associated with `hiprtcProgram` is compiled into the binary which can be loaded and run. However, if `-fpu-rdc` option is provided in the compile options, HIPRTC calls comgr and generates only the LLVM bitcode. It doesn't convert this bitcode to ISA and generate the final binary.
-
-```cpp
-std::string sarg = std::string("-fgpu-rdc");
-const char* options[] = {
- sarg.c_str() };
-hiprtcCompileProgram(prog, // hiprtcProgram
- 1, // Number of options
- options);
-```
-
-If the compilation is successful, one can load the bitcode in a local variable using the bitcode APIs provided by HIPRTC.
-
-```cpp
-size_t bitCodeSize;
-hiprtcGetBitcodeSize(prog, &bitCodeSize);
-
-vector kernel_bitcode(bitCodeSize);
-hiprtcGetBitcode(prog, kernel_bitcode.data());
-```
-
-### CU Mode vs WGP mode
-
-AMD GPUs consist of an array of workgroup processors, each built with 2 compute units (CUs) capable of executing SIMD32. All the CUs inside a workgroup processor use local data share (LDS).
-
-gfx10+ support execution of wavefront in CU mode and work-group processor mode (WGP). Please refer to section 2.3 of [RDNA3 ISA reference](https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf).
-
-gfx9 and below only supports CU mode.
-
-In WGP mode, 4 warps of a block can simultaneously be executed on the workgroup processor, where as in CU mode only 2 warps of a block can simultaneously execute on a CU. In theory, WGP mode might help with occupancy and increase the performance of certain HIP programs (if not bound to inter warp communication), but might incur performance penalty on other HIP programs which rely on atomics and inter warp communication. This also has effect of how the LDS is split between warps, please refer to [RDNA3 ISA reference](https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna3-shader-instruction-set-architecture-feb-2023_0.pdf) for more information.
-
-HIPRTC assumes **WGP mode by default** for gfx10+. This can be overridden by passing `-mcumode` to HIPRTC compile options in `hiprtcCompileProgram`.
-
-## Linker APIs
-
-The bitcode generated using the HIPRTC Bitcode APIs can be loaded using `hipModule` APIs and also can be linked with other generated bitcodes with appropriate linker flags using the HIPRTC linker APIs. This also provides more flexibility and optimizations to the applications who want to generate the binary dynamically according to their needs. The input bitcodes can be generated only for a specific architecture or it can be a bundled bitcode which is generated for multiple architectures.
-
-### Example
-
-Firstly, HIPRTC link instance or a pending linker invocation must be created using `hiprtcLinkCreate`, with the appropriate linker options provided.
-
-```cpp
-hiprtcLinkCreate( num_options, // number of options
- options, // Array of options
- option_vals, // Array of option values cast to void*
- &rtc_link_state ); // HIPRTC link state created upon success
-```
-
-Following which, the bitcode data can be added to this link instance via `hiprtcLinkAddData` (if the data is present as a string) or `hiprtcLinkAddFile` (if the data is present as a file) with the appropriate input type according to the data or the bitcode used.
-
-```cpp
-hiprtcLinkAddData(rtc_link_state, // HIPRTC link state
- input_type, // type of the input data or bitcode
- bit_code_ptr, // input data which is null terminated
- bit_code_size, // size of the input data
- "a", // optional name for this input
- 0, // size of the options
- 0, // Array of options applied to this input
- 0); // Array of option values cast to void*
-```
-
-```cpp
-hiprtcLinkAddFile(rtc_link_state, // HIPRTC link state
- input_type, // type of the input data or bitcode
- bc_file_path.c_str(), // path to the input file where bitcode is present
- 0, // size of the options
- 0, // Array of options applied to this input
- 0); // Array of option values cast to void*
-```
-
-Once the bitcodes for multiple architectures are added to the link instance, the linking of the device code must be completed using `hiprtcLinkComplete` which generates the final binary.
-
-```cpp
-hiprtcLinkComplete(rtc_link_state, // HIPRTC link state
- &binary, // upon success, points to the output binary
- &binarySize); // size of the binary is stored (optional)
-```
-
-If the `hiprtcLinkComplete` returns successfully, the generated binary can be loaded and run using the `hipModule*` APIs.
-
-```cpp
-hipModuleLoadData(&module, binary);
-```
-
-#### Note
-
-* The compiled binary must be loaded before HIPRTC link instance is destroyed using the `hiprtcLinkDestroy` API.
-
-```cpp
-hiprtcLinkDestroy(rtc_link_state);
-```
-
-* The correct sequence of calls is : `hiprtcLinkCreate`, `hiprtcLinkAddData` or `hiprtcLinkAddFile`, `hiprtcLinkComplete`, `hiprtcModuleLoadData`, `hiprtcLinkDestroy`.
-
-### Input Types
-
-HIPRTC provides `hiprtcJITInputType` enumeration type which defines the input types accepted by the Linker APIs. Here are the `enum` values of `hiprtcJITInputType`. However only the input types `HIPRTC_JIT_INPUT_LLVM_BITCODE`, `HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE` and `HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE` are supported currently.
-
-`HIPRTC_JIT_INPUT_LLVM_BITCODE` can be used to load both LLVM bitcode or LLVM IR assembly code. However, `HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE` and `HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE` are only for bundled bitcode and archive of bundled bitcode.
-
-```cpp
-HIPRTC_JIT_INPUT_CUBIN = 0,
-HIPRTC_JIT_INPUT_PTX,
-HIPRTC_JIT_INPUT_FATBINARY,
-HIPRTC_JIT_INPUT_OBJECT,
-HIPRTC_JIT_INPUT_LIBRARY,
-HIPRTC_JIT_INPUT_NVVM,
-HIPRTC_JIT_NUM_LEGACY_INPUT_TYPES,
-HIPRTC_JIT_INPUT_LLVM_BITCODE = 100,
-HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE = 101,
-HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE = 102,
-HIPRTC_JIT_NUM_INPUT_TYPES = (HIPRTC_JIT_NUM_LEGACY_INPUT_TYPES + 3)
-```
-
-### Backward Compatibility of LLVM Bitcode/IR
-
-For HIP applications utilizing HIPRTC to compile LLVM bitcode/IR, compatibility is assured only when the ROCm or HIP SDK version used for generating the LLVM bitcode/IR matches the version used during the runtime compilation. When an application requires the ingestion of bitcode/IR not derived from the currently installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that are compatible with the version of the bitcode/IR.
-
-comgr, a shared library, incorporates the LLVM/Clang compiler that HIPRTC relies on. To identify the bitcode/IR version that comgr is compatible with, one can execute "clang -v" using the clang binary from the same ROCm or HIP SDK package. For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries released by AMD around mid 2022 would be the best choice, assuming the LLVM/Clang version included in the package is also version 14.
-
-To ensure smooth operation and compatibility, an application may choose to ship the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to clearly specify the version requirements and dependencies. This approach guarantees that the application can correctly compile the specified version of bitcode/IR.
-
-### Link Options
-
-* `HIPRTC_JIT_IR_TO_ISA_OPT_EXT` - AMD Only. Options to be passed on to link step of compiler by `hiprtcLinkCreate`.
-* `HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT` - AMD Only. Count of options passed on to link step of compiler.
-
-Example:
-
-```cpp
-const char* isaopts[] = {"-mllvm", "-inline-threshold=1", "-mllvm", "-inlinehint-threshold=1"};
-std::vector jit_options = {HIPRTC_JIT_IR_TO_ISA_OPT_EXT,
- HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT};
-size_t isaoptssize = 4;
-const void* lopts[] = {(void*)isaopts, (void*)(isaoptssize)};
-hiprtcLinkState linkstate;
-hiprtcLinkCreate(2, jit_options.data(), (void**)lopts, &linkstate);
-```
-
-## Error Handling
-
-HIPRTC defines the `hiprtcResult` enumeration type and a function `hiprtcGetErrorString` for API call error handling. `hiprtcResult` `enum` defines the API result codes. HIPRTC APIs return `hiprtcResult` to indicate the call result. `hiprtcGetErrorString` function returns a string describing the given `hiprtcResult` code, e.g., HIPRTC_SUCCESS to "HIPRTC_SUCCESS". For unrecognized enumeration values, it returns "Invalid HIPRTC error code".
-
-`hiprtcResult` `enum` supported values and the `hiprtcGetErrorString` usage are mentioned below.
-
-```cpp
-HIPRTC_SUCCESS = 0,
-HIPRTC_ERROR_OUT_OF_MEMORY = 1,
-HIPRTC_ERROR_PROGRAM_CREATION_FAILURE = 2,
-HIPRTC_ERROR_INVALID_INPUT = 3,
-HIPRTC_ERROR_INVALID_PROGRAM = 4,
-HIPRTC_ERROR_INVALID_OPTION = 5,
-HIPRTC_ERROR_COMPILATION = 6,
-HIPRTC_ERROR_LINKING = 7,
-HIPRTC_ERROR_BUILTIN_OPERATION_FAILURE = 8,
-HIPRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION = 9,
-HIPRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 10,
-HIPRTC_ERROR_NAME_EXPRESSION_NOT_VALID = 11,
-HIPRTC_ERROR_INTERNAL_ERROR = 12
-```
-
-```cpp
-hiprtcResult result;
-result = hiprtcCompileProgram(prog, 1, opts);
-if (result != HIPRTC_SUCCESS) {
-std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(result);
-}
-```
-
-## HIPRTC General APIs
-
-HIPRTC provides the following API for querying the version.
-
-`hiprtcVersion(int* major, int* minor)` - This sets the output parameters major and minor with the HIP Runtime compilation major version and minor version number respectively.
-
-Currently, it returns hardcoded value. This should be implemented to return HIP runtime major and minor version in the future releases.
-
-## Lowered Names (Mangled Names)
-
-HIPRTC mangles the `__global__` function names and names of `__device__` and `__constant__` variables. If the generated binary is being loaded using the HIP Runtime API, the kernel function or `__device__/__constant__` variable must be looked up by name, but this is very hard when the name has been mangled. To overcome this, HIPRTC provides API functions that map `__global__` function or `__device__/__constant__` variable names in the source to the mangled names present in the generated binary.
-
-The two APIs `hiprtcAddNameExpression` and `hiprtcGetLoweredName` provide this functionality. First, a 'name expression' string denoting the address for the `__global__` function or `__device__/__constant__` variable is provided to `hiprtcAddNameExpression`. Then, the program is compiled with `hiprtcCompileProgram`. During compilation, HIPRTC will parse the name expression string as a C++ constant expression at the end of the user program. Finally, the function `hiprtcGetLoweredName` is called with the original name expression and it returns a pointer to the lowered name. The lowered name can be used to refer to the kernel or variable in the HIP Runtime API.
-
-### Note
-
-* The identical name expression string must be provided on a subsequent call to `hiprtcGetLoweredName` to extract the lowered name.
-* The correct sequence of calls is : `hiprtcAddNameExpression`, `hiprtcCompileProgram`, `hiprtcGetLoweredName`, `hiprtcDestroyProgram`.
-* The lowered names must be fetched using `hiprtcGetLoweredName` only after the HIPRTC program has been compiled, and before it has been destroyed.
-
-### Example
-
-kernel containing various definitions `__global__` functions/function templates and `__device__/__constant__` variables can be stored in a string.
-
-```cpp
-static constexpr const char gpu_program[] {
-R"(
-__device__ int V1; // set from host code
-static __global__ void f1(int *result) { *result = V1 + 10; }
-namespace N1 {
-namespace N2 {
-__constant__ int V2; // set from host code
-__global__ void f2(int *result) { *result = V2 + 20; }
-}
-}
-template
-__global__ void f3(int *result) { *result = sizeof(T); }
-)"};
-```
-
-`hiprtcAddNameExpression` is called with various name expressions referring to the address of `__global__` functions and `__device__/__constant__` variables.
-
-```cpp
-kernel_name_vec.push_back("&f1");
-kernel_name_vec.push_back("N1::N2::f2");
-kernel_name_vec.push_back("f3");
-for (auto&& x : kernel_name_vec) hiprtcAddNameExpression(prog, x.c_str());
-variable_name_vec.push_back("&V1");
-variable_name_vec.push_back("&N1::N2::V2");
-for (auto&& x : variable_name_vec) hiprtcAddNameExpression(prog, x.c_str());
-```
-
-After which, the program is compiled using `hiprtcCompileProgram` and the generated binary is loaded using `hipModuleLoadData`. And the mangled names can be fetched using `hirtcGetLoweredName`.
-
-```cpp
-for (decltype(variable_name_vec.size()) i = 0; i != variable_name_vec.size(); ++i) {
- const char* name;
- hiprtcGetLoweredName(prog, variable_name_vec[i].c_str(), &name);
-}
-```
-
-```cpp
-for (decltype(kernel_name_vec.size()) i = 0; i != kernel_name_vec.size(); ++i) {
- const char* name;
- hiprtcGetLoweredName(prog, kernel_name_vec[i].c_str(), &name);
-}
-```
-
-The mangled name of the variables are used to look up the variable in the module and update its value.
-
-```cpp
-hipDeviceptr_t variable_addr;
-size_t bytes{};
-hipModuleGetGlobal(&variable_addr, &bytes, module, name);
-hipMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value));
-```
-
-Finally, the mangled name of the kernel is used to launch it using the `hipModule` APIs.
-
-```cpp
-hipFunction_t kernel;
-hipModuleGetFunction(&kernel, module, name);
-hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config);
-```
-
-Please have a look at `hiprtcGetLoweredName.cpp` for the detailed example.
-
-## Versioning
-
-HIPRTC follows the below versioning.
-
-* Linux
- * HIPRTC follows the same versioning as HIP runtime library.
- * The `so` name field for the shared library is set to MAJOR version. For example, for HIP 5.3 the `so` name is set to 5 (`hiprtc.so.5`).
-* Windows
- * HIPRTC dll is named as `hiprtcXXYY.dll` where XX is MAJOR version and YY is MINOR version. For example, for HIP 5.3 the name is `hiprtc0503.dll`.
-
-## HIP header support
-
-* Added HIPRTC support for all the hip common header files such as library_types.h, hip_math_constants.h, hip_complex.h, math_functions.h, surface_types.h etc. from 6.1. HIPRTC users need not include any HIP macros or constants explicitly in their header files. All of these should get included via HIPRTC builtins when the app links to HIPRTC library.
-
-## Deprecation notice
-
-* Currently HIPRTC APIs are separated from HIP APIs and HIPRTC is available as a separate library `libhiprtc.so`/`libhiprtc.dll`. But on Linux, HIPRTC symbols are also present in `libamdhip64.so` in order to support the existing applications. Gradually, these symbols will be removed from HIP library and applications using HIPRTC will be required to explicitly link to HIPRTC library. However, on Windows `hiprtc.dll` must be used as the `amdhip64.dll` doesn't contain the HIPRTC symbols.
-* Data types such as `uint32_t`, `uint64_t`, `int32_t`, `int64_t` defined in std namespace in HIPRTC are deprecated earlier and are being removed from ROCm release 6.1 since these can conflict with the standard C++ data types. These data types are now prefixed with `__hip__`, e.g. `__hip_uint32_t`. Applications previously using `std::uint32_t` or similar types can use `__hip_` prefixed types to avoid conflicts with standard std namespace or application can have their own definitions for these types. Also, type_traits templates previously defined in std namespace are moved to `__hip_internal` namespace as implementation details.
diff --git a/docs/how-to/hip_rtc.rst b/docs/how-to/hip_rtc.rst
new file mode 100644
index 0000000000..223e11081c
--- /dev/null
+++ b/docs/how-to/hip_rtc.rst
@@ -0,0 +1,726 @@
+.. meta::
+ :description: HIP runtime compiler (RTC)
+ :keywords: AMD, ROCm, HIP, CUDA, RTC, HIP runtime compiler
+
+.. _hip_runtime_compiler_how-to:
+
+*******************************************************************************
+Programming for HIP runtime compiler (RTC)
+*******************************************************************************
+
+HIP supports the kernels compilation at runtime with the ``hiprtc*`` APIs.
+Kernels can be stored as a text string and can be passed to HIPRTC APIs
+alongside options to guide the compilation.
+
+.. note::
+
+ * This library can be used for compilation on systems without AMD GPU drivers
+ installed (offline compilation). However, running the compiled code still
+ requires both the HIP runtime library and GPU drivers on the target system.
+ * This library depends on Code Object Manager (comgr). You can try to
+ statically link comgr into HIPRTC to avoid ambiguity.
+ * Developers can bundle this library with their application.
+
+Compilation APIs
+===============================================================================
+
+To use HIPRTC functionality the header needs to be included:
+
+.. code-block:: cpp
+
+ #include
+
+Kernels can be stored in a string:
+
+.. code-block:: cpp
+
+ static constexpr auto kernel_source {
+ R"(
+ extern "C"
+ __global__ void vector_add(float* output, float* input1, float* input2, size_t size) {
+ int i = threadIdx.x;
+ if (i < size) {
+ output[i] = input1[i] + input2[i];
+ }
+ }
+ )"};
+
+To compile this kernel, it needs to be associated with
+:cpp:struct:`hiprtcProgram` type, which is done by declaring :code:`hiprtcProgram prog;`
+and associating the string of kernel with this program:
+
+.. code-block:: cpp
+
+ hiprtcCreateProgram(&prog, // HIPRTC program handle
+ kernel_source, // HIP kernel source string
+ "vector_add.cpp", // Name of the HIP program, can be null or an empty string
+ 0, // Number of headers
+ NULL, // Header sources
+ NULL); // Name of header files
+
+:cpp:func:`hiprtcCreateProgram` API also allows you to add headers which can be
+included in your RTC program. For online compilation, the compiler pre-defines
+HIP device API functions, HIP specific types and macros for device compilation,
+but doesn't include standard C/C++ headers by default. Users can only include
+header files provided to :cpp:func:`hiprtcCreateProgram`.
+
+After associating the kernel string with :cpp:struct:`hiprtcProgram`, you can
+now compile this program using:
+
+.. code-block:: cpp
+
+ hiprtcCompileProgram(prog, // hiprtcProgram
+ 0, // Number of options
+ options); // Clang Options [Supported Clang Options](clang_options.md)
+
+:cpp:func:`hiprtcCompileProgram` returns a status value which can be converted
+to string via :cpp:func:`hiprtcGetErrorString`. If compilation is successful,
+:cpp:func:`hiprtcCompileProgram` will return ``HIPRTC_SUCCESS``.
+
+if the compilation fails or produces warnings, you can look up the logs via:
+
+.. code-block:: cpp
+
+ size_t logSize;
+ hiprtcGetProgramLogSize(prog, &logSize);
+
+ if (logSize) {
+ string log(logSize, '\0');
+ hiprtcGetProgramLog(prog, &log[0]);
+ // Corrective action with logs
+ }
+
+If the compilation is successful, you can load the compiled binary in a local
+variable.
+
+.. code-block:: cpp
+
+ size_t codeSize;
+ hiprtcGetCodeSize(prog, &codeSize);
+
+ vector kernel_binary(codeSize);
+ hiprtcGetCode(prog, kernel_binary.data());
+
+After loading the binary, :cpp:struct:`hiprtcProgram` can be destroyed.
+:code:`hiprtcDestroyProgram(&prog);`
+
+The binary present in ``kernel_binary`` can now be loaded via
+:cpp:func:`hipModuleLoadData` API.
+
+.. code-block:: cpp
+
+ hipModule_t module;
+ hipFunction_t kernel;
+
+ hipModuleLoadData(&module, kernel_binary.data());
+ hipModuleGetFunction(&kernel, module, "vector_add");
+
+And now this kernel can be launched via ``hipModule`` APIs.
+
+The full example is below:
+
+.. code-block:: cpp
+
+ #include
+ #include
+
+ #include
+ #include
+ #include
+
+ #define CHECK_RET_CODE(call, ret_code) \
+ { \
+ if ((call) != ret_code) { \
+ std::cout << "Failed in call: " << #call << std::endl; \
+ std::abort(); \
+ } \
+ }
+ #define HIP_CHECK(call) CHECK_RET_CODE(call, hipSuccess)
+ #define HIPRTC_CHECK(call) CHECK_RET_CODE(call, HIPRTC_SUCCESS)
+
+ // source code for hiprtc
+ static constexpr auto kernel_source{
+ R"(
+ extern "C"
+ __global__ void vector_add(float* output, float* input1, float* input2, size_t size) {
+ int i = threadIdx.x;
+ if (i < size) {
+ output[i] = input1[i] + input2[i];
+ }
+ }
+ )"};
+
+ int main() {
+ hiprtcProgram prog;
+ auto rtc_ret_code = hiprtcCreateProgram(&prog, // HIPRTC program handle
+ kernel_source, // kernel source string
+ "vector_add.cpp", // Name of the file
+ 0, // Number of headers
+ NULL, // Header sources
+ NULL); // Name of header file
+
+ if (rtc_ret_code != HIPRTC_SUCCESS) {
+ std::cout << "Failed to create program" << std::endl;
+ std::abort();
+ }
+
+ hipDeviceProp_t props;
+ int device = 0;
+ HIP_CHECK(hipGetDeviceProperties(&props, device));
+ std::string sarg = std::string("--gpu-architecture=") +
+ props.gcnArchName; // device for which binary is to be generated
+
+ const char* options[] = {sarg.c_str()};
+
+ rtc_ret_code = hiprtcCompileProgram(prog, // hiprtcProgram
+ 0, // Number of options
+ options); // Clang Options
+ if (rtc_ret_code != HIPRTC_SUCCESS) {
+ std::cout << "Failed to create program" << std::endl;
+ std::abort();
+ }
+
+ size_t logSize;
+ HIPRTC_CHECK(hiprtcGetProgramLogSize(prog, &logSize));
+
+ if (logSize) {
+ std::string log(logSize, '\0');
+ HIPRTC_CHECK(hiprtcGetProgramLog(prog, &log[0]));
+ std::cout << "Compilation failed or produced warnings: " << log << std::endl;
+ std::abort();
+ }
+
+ size_t codeSize;
+ HIPRTC_CHECK(hiprtcGetCodeSize(prog, &codeSize));
+
+ std::vector kernel_binary(codeSize);
+ HIPRTC_CHECK(hiprtcGetCode(prog, kernel_binary.data()));
+
+ HIPRTC_CHECK(hiprtcDestroyProgram(&prog));
+
+ hipModule_t module;
+ hipFunction_t kernel;
+
+ HIP_CHECK(hipModuleLoadData(&module, kernel_binary.data()));
+ HIP_CHECK(hipModuleGetFunction(&kernel, module, "vector_add"));
+
+ constexpr size_t ele_size = 256; // total number of items to add
+ std::vector hinput, output;
+ hinput.reserve(ele_size);
+ output.reserve(ele_size);
+ for (size_t i = 0; i < ele_size; i++) {
+ hinput.push_back(static_cast(i + 1));
+ output.push_back(0.0f);
+ }
+
+ float *dinput1, *dinput2, *doutput;
+ HIP_CHECK(hipMalloc(&dinput1, sizeof(float) * ele_size));
+ HIP_CHECK(hipMalloc(&dinput2, sizeof(float) * ele_size));
+ HIP_CHECK(hipMalloc(&doutput, sizeof(float) * ele_size));
+
+ HIP_CHECK(hipMemcpy(dinput1, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+ HIP_CHECK(hipMemcpy(dinput2, hinput.data(), sizeof(float) * ele_size, hipMemcpyHostToDevice));
+
+ struct {
+ float* output;
+ float* input1;
+ float* input2;
+ size_t size;
+ } args{doutput, dinput1, dinput2, ele_size};
+
+ auto size = sizeof(args);
+ void* config[] = {HIP_LAUNCH_PARAM_BUFFER_POINTER, &args, HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
+ HIP_LAUNCH_PARAM_END};
+
+ HIP_CHECK(hipModuleLaunchKernel(kernel, 1, 1, 1, ele_size, 1, 1, 0, nullptr, nullptr, config));
+
+ HIP_CHECK(hipMemcpy(output.data(), doutput, sizeof(float) * ele_size, hipMemcpyDeviceToHost));
+
+ for (size_t i = 0; i < ele_size; i++) {
+ if ((hinput[i] + hinput[i]) != output[i]) {
+ std::cout << "Failed in validation: " << (hinput[i] + hinput[i]) << " - " << output[i]
+ << std::endl;
+ std::abort();
+ }
+ }
+ std::cout << "Passed" << std::endl;
+
+ HIP_CHECK(hipFree(dinput1));
+ HIP_CHECK(hipFree(dinput2));
+ HIP_CHECK(hipFree(doutput));
+ }
+
+
+Kernel Compilation Cache
+===============================================================================
+
+HIPRTC incorporates a cache to avoid recompiling kernels between program
+executions. The contents of the cache include the kernel source code (including
+the contents of any ``#include`` headers), the compilation flags, and the
+compiler version. After a ROCm version update, the kernels are progressively
+recompiled, and the new results are cached. When the cache is disabled, each
+kernel is recompiled every time it is requested.
+
+Use the following environment variables to manage the cache status as enabled or
+disabled, the location for storing the cache contents, and the cache eviction
+policy:
+
+* ``AMD_COMGR_CACHE`` By default this variable is unset and the
+ compilation cache feature is enabled. To disable the feature set the
+ environment variable to a value of ``0``.
+
+* ``AMD_COMGR_CACHE_DIR``: By default the value of this environment variable is
+ defined as ``$XDG_CACHE_HOME/comgr``, which defaults to
+ ``$USER/.cache/comgr`` on Linux, and ``%LOCALAPPDATA%\cache\comgr``
+ on Windows. You can specify a different directory for the environment variable
+ to change the path for cache storage. If the runtime fails to access the
+ specified cache directory the cache is disabled. If the environment variable
+ is set to an empty string (``""``), the default directory is used.
+
+* ``AMD_COMGR_CACHE_POLICY``: If assigned a value, the string is interpreted and
+ applied to the cache pruning policy. The string format is consistent with
+ `Clang's ThinLTO cache pruning policy `_.
+ The default policy is defined as:
+ ``prune_interval=1h:prune_expiration=0h:cache_size=75%:cache_size_bytes=30g:cache_size_files=0``.
+ If the runtime fails to parse the defined string, or the environment variable
+ is set to an empty string (""), the cache is disabled.
+
+.. note::
+
+ This cache is also shared with the OpenCL runtime shipped with ROCm.
+
+HIPRTC specific options
+===============================================================================
+
+HIPRTC provides a few HIPRTC specific flags:
+
+* ``--gpu-architecture`` : This flag can guide the code object generation for a
+ specific GPU architecture. Example:
+ ``--gpu-architecture=gfx906:sramecc+:xnack-``, its equivalent to
+ ``--offload-arch``.
+
+ * This option is compulsory if compilation is done on a system without AMD
+ GPUs supported by HIP runtime.
+
+ * Otherwise, HIPRTC will load the hip runtime and gather the current device
+ and its architecture info and use it as option.
+
+* ``-fgpu-rdc`` : This flag when provided during the
+ :cpp:func:`hiprtcCreateProgram` generates the bitcode (HIPRTC doesn't convert
+ this bitcode into ISA and binary). This bitcode can later be fetched using
+ :cpp:func:`hiprtcGetBitcode` and :cpp:func:`hiprtcGetBitcodeSize` APIs.
+
+Bitcode
+-------------------------------------------------------------------------------
+
+In the usual scenario, the kernel associated with :cpp:struct:`hiprtcProgram` is
+compiled into the binary which can be loaded and run. However, if ``-fgpu-rdc``
+option is provided in the compile options, HIPRTC calls comgr and generates only
+the LLVM bitcode. It doesn't convert this bitcode to ISA and generate the final
+binary.
+
+.. code-block:: cpp
+
+ std::string sarg = std::string("-fgpu-rdc");
+ const char* options[] = {
+ sarg.c_str() };
+ hiprtcCompileProgram(prog, // hiprtcProgram
+ 1, // Number of options
+ options);
+
+If the compilation is successful, one can load the bitcode in a local variable
+using the bitcode APIs provided by HIPRTC.
+
+.. code-block:: cpp
+
+ size_t bitCodeSize;
+ hiprtcGetBitcodeSize(prog, &bitCodeSize);
+
+ vector kernel_bitcode(bitCodeSize);
+ hiprtcGetBitcode(prog, kernel_bitcode.data());
+
+CU Mode vs WGP mode
+-------------------------------------------------------------------------------
+
+AMD GPUs consist of an array of workgroup processors, each built with 2 compute
+units (CUs) capable of executing SIMD32. All the CUs inside a workgroup
+processor use local data share (LDS).
+
+gfx10+ support execution of wavefront in CU mode and work-group processor mode
+(WGP). Please refer to section 2.3 of `RDNA3 ISA reference `_.
+
+gfx9 and below only supports CU mode.
+
+In WGP mode, 4 warps of a block can simultaneously be executed on the workgroup
+processor, where as in CU mode only 2 warps of a block can simultaneously
+execute on a CU. In theory, WGP mode might help with occupancy and increase the
+performance of certain HIP programs (if not bound to inter warp communication),
+but might incur performance penalty on other HIP programs which rely on atomics
+and inter warp communication. This also has effect of how the LDS is split
+between warps, please refer to `RDNA3 ISA reference `_ for more information.
+
+.. note::
+
+ HIPRTC assumes **WGP mode by default** for gfx10+. This can be overridden by
+ passing ``-mcumode`` to HIPRTC compile options in
+ :cpp:func:`hiprtcCompileProgram`.
+
+Linker APIs
+===============================================================================
+
+The bitcode generated using the HIPRTC Bitcode APIs can be loaded using
+``hipModule`` APIs and also can be linked with other generated bitcodes with
+appropriate linker flags using the HIPRTC linker APIs. This also provides more
+flexibility and optimizations to the applications who want to generate the
+binary dynamically according to their needs. The input bitcodes can be generated
+only for a specific architecture or it can be a bundled bitcode which is
+generated for multiple architectures.
+
+Example
+-------------------------------------------------------------------------------
+
+Firstly, HIPRTC link instance or a pending linker invocation must be created
+using :cpp:func:`hiprtcLinkCreate`, with the appropriate linker options
+provided.
+
+.. code-block:: cpp
+
+ hiprtcLinkCreate( num_options, // number of options
+ options, // Array of options
+ option_vals, // Array of option values cast to void*
+ &rtc_link_state ); // HIPRTC link state created upon success
+
+Following which, the bitcode data can be added to this link instance via
+:cpp:func:`hiprtcLinkAddData` (if the data is present as a string) or
+:cpp:func:`hiprtcLinkAddFile` (if the data is present as a file) with the
+appropriate input type according to the data or the bitcode used.
+
+.. code-block:: cpp
+
+ hiprtcLinkAddData(rtc_link_state, // HIPRTC link state
+ input_type, // type of the input data or bitcode
+ bit_code_ptr, // input data which is null terminated
+ bit_code_size, // size of the input data
+ "a", // optional name for this input
+ 0, // size of the options
+ 0, // Array of options applied to this input
+ 0); // Array of option values cast to void*
+
+.. code-block:: cpp
+
+ hiprtcLinkAddFile(rtc_link_state, // HIPRTC link state
+ input_type, // type of the input data or bitcode
+ bc_file_path.c_str(), // path to the input file where bitcode is present
+ 0, // size of the options
+ 0, // Array of options applied to this input
+ 0); // Array of option values cast to void*
+
+Once the bitcodes for multiple architectures are added to the link instance, the
+linking of the device code must be completed using :cpp:func:`hiprtcLinkComplete`
+which generates the final binary.
+
+.. code-block:: cpp
+
+ hiprtcLinkComplete(rtc_link_state, // HIPRTC link state
+ &binary, // upon success, points to the output binary
+ &binarySize); // size of the binary is stored (optional)
+
+If the :cpp:func:`hiprtcLinkComplete` returns successfully, the generated binary
+can be loaded and run using the ``hipModule*`` APIs.
+
+.. code-block:: cpp
+
+ hipModuleLoadData(&module, binary);
+
+.. note::
+
+ * The compiled binary must be loaded before HIPRTC link instance is destroyed
+ using the :cpp:func:`hiprtcLinkDestroy` API.
+
+ .. code-block:: cpp
+
+ hiprtcLinkDestroy(rtc_link_state);
+
+ * The correct sequence of calls is : :cpp:func:`hiprtcLinkCreate`,
+ :cpp:func:`hiprtcLinkAddData` or :cpp:func:`hiprtcLinkAddFile`,
+ :cpp:func:`hiprtcLinkComplete`, :cpp:func:`hipModuleLoadData`,
+ :cpp:func:`hiprtcLinkDestroy`.
+
+Input Types
+-------------------------------------------------------------------------------
+
+HIPRTC provides ``hiprtcJITInputType`` enumeration type which defines the input
+types accepted by the Linker APIs. Here are the ``enum`` values of
+``hiprtcJITInputType``. However only the input types
+``HIPRTC_JIT_INPUT_LLVM_BITCODE``, ``HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE`` and
+``HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE`` are supported currently.
+
+``HIPRTC_JIT_INPUT_LLVM_BITCODE`` can be used to load both LLVM bitcode or LLVM
+IR assembly code. However, ``HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE`` and
+``HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE`` are only for bundled
+bitcode and archive of bundled bitcode.
+
+.. code-block:: cpp
+
+ HIPRTC_JIT_INPUT_CUBIN = 0,
+ HIPRTC_JIT_INPUT_PTX,
+ HIPRTC_JIT_INPUT_FATBINARY,
+ HIPRTC_JIT_INPUT_OBJECT,
+ HIPRTC_JIT_INPUT_LIBRARY,
+ HIPRTC_JIT_INPUT_NVVM,
+ HIPRTC_JIT_NUM_LEGACY_INPUT_TYPES,
+ HIPRTC_JIT_INPUT_LLVM_BITCODE = 100,
+ HIPRTC_JIT_INPUT_LLVM_BUNDLED_BITCODE = 101,
+ HIPRTC_JIT_INPUT_LLVM_ARCHIVES_OF_BUNDLED_BITCODE = 102,
+ HIPRTC_JIT_NUM_INPUT_TYPES = (HIPRTC_JIT_NUM_LEGACY_INPUT_TYPES + 3)
+
+Backward Compatibility of LLVM Bitcode/IR
+-------------------------------------------------------------------------------
+
+For HIP applications utilizing HIPRTC to compile LLVM bitcode/IR, compatibility
+is assured only when the ROCm or HIP SDK version used for generating the LLVM
+bitcode/IR matches the version used during the runtime compilation. When an
+application requires the ingestion of bitcode/IR not derived from the currently
+installed AMD compiler, it must run with HIPRTC and comgr dynamic libraries that
+are compatible with the version of the bitcode/IR.
+
+`Comgr `_ is a
+shared library that incorporates the LLVM/Clang compiler that HIPRTC relies on.
+To identify the bitcode/IR version that comgr is compatible with, one can
+execute "clang -v" using the clang binary from the same ROCm or HIP SDK package.
+For instance, if compiling bitcode/IR version 14, the HIPRTC and comgr libraries
+released by AMD around mid 2022 would be the best choice, assuming the
+LLVM/Clang version included in the package is also version 14.
+
+To ensure smooth operation and compatibility, an application may choose to ship
+the specific versions of HIPRTC and comgr dynamic libraries, or it may opt to
+clearly specify the version requirements and dependencies. This approach
+guarantees that the application can correctly compile the specified version of
+bitcode/IR.
+
+Link Options
+-------------------------------------------------------------------------------
+
+* ``HIPRTC_JIT_IR_TO_ISA_OPT_EXT`` - AMD Only. Options to be passed on to link
+ step of compiler by :cpp:func:`hiprtcLinkCreate`.
+
+* ``HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT`` - AMD Only. Count of options passed on
+ to link step of compiler.
+
+Example:
+
+.. code-block:: cpp
+
+ const char* isaopts[] = {"-mllvm", "-inline-threshold=1", "-mllvm", "-inlinehint-threshold=1"};
+ std::vector jit_options = {HIPRTC_JIT_IR_TO_ISA_OPT_EXT,
+ HIPRTC_JIT_IR_TO_ISA_OPT_COUNT_EXT};
+ size_t isaoptssize = 4;
+ const void* lopts[] = {(void*)isaopts, (void*)(isaoptssize)};
+ hiprtcLinkState linkstate;
+ hiprtcLinkCreate(2, jit_options.data(), (void**)lopts, &linkstate);
+
+Error Handling
+===============================================================================
+
+HIPRTC defines the ``hiprtcResult`` enumeration type and a function
+:cpp:func:`hiprtcGetErrorString` for API call error handling. ``hiprtcResult``
+``enum`` defines the API result codes. HIPRTC APIs return ``hiprtcResult`` to
+indicate the call result. :cpp:func:`hiprtcGetErrorString` function returns a
+string describing the given ``hiprtcResult`` code, for example HIPRTC_SUCCESS to
+"HIPRTC_SUCCESS". For unrecognized enumeration values, it returns
+"Invalid HIPRTC error code".
+
+``hiprtcResult`` ``enum`` supported values and the
+:cpp:func:`hiprtcGetErrorString` usage are mentioned below.
+
+.. code-block:: cpp
+
+ HIPRTC_SUCCESS = 0,
+ HIPRTC_ERROR_OUT_OF_MEMORY = 1,
+ HIPRTC_ERROR_PROGRAM_CREATION_FAILURE = 2,
+ HIPRTC_ERROR_INVALID_INPUT = 3,
+ HIPRTC_ERROR_INVALID_PROGRAM = 4,
+ HIPRTC_ERROR_INVALID_OPTION = 5,
+ HIPRTC_ERROR_COMPILATION = 6,
+ HIPRTC_ERROR_LINKING = 7,
+ HIPRTC_ERROR_BUILTIN_OPERATION_FAILURE = 8,
+ HIPRTC_ERROR_NO_NAME_EXPRESSIONS_AFTER_COMPILATION = 9,
+ HIPRTC_ERROR_NO_LOWERED_NAMES_BEFORE_COMPILATION = 10,
+ HIPRTC_ERROR_NAME_EXPRESSION_NOT_VALID = 11,
+ HIPRTC_ERROR_INTERNAL_ERROR = 12
+
+.. code-block:: cpp
+
+ hiprtcResult result;
+ result = hiprtcCompileProgram(prog, 1, opts);
+ if (result != HIPRTC_SUCCESS) {
+ std::cout << "hiprtcCompileProgram fails with error " << hiprtcGetErrorString(result);
+ }
+
+HIPRTC General APIs
+===============================================================================
+
+HIPRTC provides ``hiprtcVersion(int* major, int* minor)`` for querying the
+version. This sets the output parameters major and minor with the HIP Runtime
+compilation major version and minor version number respectively.
+
+Currently, it returns hardcoded values. This should be implemented to return HIP
+runtime major and minor version in the future releases.
+
+Lowered Names (Mangled Names)
+===============================================================================
+
+HIPRTC mangles the ``__global__`` function names and names of ``__device__`` and
+``__constant__`` variables. If the generated binary is being loaded using the
+HIP Runtime API, the kernel function or ``__device__/__constant__`` variable
+must be looked up by name, but this is very hard when the name has been mangled.
+To overcome this, HIPRTC provides API functions that map ``__global__`` function
+or ``__device__/__constant__`` variable names in the source to the mangled names
+present in the generated binary.
+
+The two APIs :cpp:func:`hiprtcAddNameExpression` and
+:cpp:func:`hiprtcGetLoweredName` provide this functionality. First, a 'name
+expression' string denoting the address for the ``__global__`` function or
+``__device__/__constant__`` variable is provided to
+:cpp:func:`hiprtcAddNameExpression`. Then, the program is compiled with
+:cpp:func:`hiprtcCreateProgram`. During compilation, HIPRTC will parse the name
+expression string as a C++ constant expression at the end of the user program.
+Finally, the function :cpp:func:`hiprtcGetLoweredName` is called with the
+original name expression and it returns a pointer to the lowered name. The
+lowered name can be used to refer to the kernel or variable in the HIP Runtime
+API.
+
+.. note::
+
+ * The identical name expression string must be provided on a subsequent call
+ to :cpp:func:`hiprtcGetLoweredName` to extract the lowered name.
+
+ * The correct sequence of calls is : :cpp:func:`hiprtcAddNameExpression`,
+ :cpp:func:`hiprtcCreateProgram`, :cpp:func:`hiprtcGetLoweredName`,
+ :cpp:func:`hiprtcDestroyProgram`.
+
+ * The lowered names must be fetched using :cpp:func:`hiprtcGetLoweredName`
+ only after the HIPRTC program has been compiled, and before it has been
+ destroyed.
+
+Example
+-------------------------------------------------------------------------------
+
+Kernel containing various definitions ``__global__`` functions/function
+templates and ``__device__/__constant__`` variables can be stored in a string.
+
+.. code-block:: cpp
+
+ static constexpr const char gpu_program[] {
+ R"(
+ __device__ int V1; // set from host code
+ static __global__ void f1(int *result) { *result = V1 + 10; }
+ namespace N1 {
+ namespace N2 {
+ __constant__ int V2; // set from host code
+ __global__ void f2(int *result) { *result = V2 + 20; }
+ }
+ }
+ template
+ __global__ void f3(int *result) { *result = sizeof(T); }
+ )"};
+
+:cpp:func:`hiprtcAddNameExpression` is called with various name expressions
+referring to the address of ``__global__`` functions and
+``__device__/__constant__`` variables.
+
+.. code-block:: cpp
+
+ kernel_name_vec.push_back("&f1");
+ kernel_name_vec.push_back("N1::N2::f2");
+ kernel_name_vec.push_back("f3");
+ for (auto&& x : kernel_name_vec) hiprtcAddNameExpression(prog, x.c_str());
+ variable_name_vec.push_back("&V1");
+ variable_name_vec.push_back("&N1::N2::V2");
+ for (auto&& x : variable_name_vec) hiprtcAddNameExpression(prog, x.c_str());
+
+After which, the program is compiled using :cpp:func:`hiprtcCompileProgram`, the
+generated binary is loaded using :cpp:func:`hipModuleLoadData`, and the mangled
+names can be fetched using :cpp:func:`hirtcGetLoweredName`.
+
+.. code-block:: cpp
+
+ for (decltype(variable_name_vec.size()) i = 0; i != variable_name_vec.size(); ++i) {
+ const char* name;
+ hiprtcGetLoweredName(prog, variable_name_vec[i].c_str(), &name);
+ }
+
+.. code-block:: cpp
+
+ for (decltype(kernel_name_vec.size()) i = 0; i != kernel_name_vec.size(); ++i) {
+ const char* name;
+ hiprtcGetLoweredName(prog, kernel_name_vec[i].c_str(), &name);
+ }
+
+The mangled name of the variables are used to look up the variable in the module
+and update its value.
+
+.. code-block:: cpp
+
+ hipDeviceptr_t variable_addr;
+ size_t bytes{};
+ hipModuleGetGlobal(&variable_addr, &bytes, module, name);
+ hipMemcpyHtoD(variable_addr, &initial_value, sizeof(initial_value));
+
+
+Finally, the mangled name of the kernel is used to launch it using the
+``hipModule`` APIs.
+
+.. code-block:: cpp
+
+ hipFunction_t kernel;
+ hipModuleGetFunction(&kernel, module, name);
+ hipModuleLaunchKernel(kernel, 1, 1, 1, 1, 1, 1, 0, nullptr, nullptr, config);
+
+Versioning
+===============================================================================
+
+HIPRTC uses the following versioning:
+
+* Linux
+
+ * HIPRTC follows the same versioning as HIP runtime library.
+ * The ``so`` name field for the shared library is set to MAJOR version. For
+ example, for HIP 5.3 the ``so`` name is set to 5 (``hiprtc.so.5``).
+
+* Windows
+
+ * HIPRTC dll is named as ``hiprtcXXYY.dll`` where ``XX`` is MAJOR version and
+ ``YY`` is MINOR version. For example, for HIP 5.3 the name is
+ ``hiprtc0503.dll``.
+
+HIP header support
+===============================================================================
+
+Added HIPRTC support for all the hip common header files such as
+``library_types.h``, ``hip_math_constants.h``, ``hip_complex.h``,
+``math_functions.h``, ``surface_types.h`` etc. from 6.1. HIPRTC users need not
+include any HIP macros or constants explicitly in their header files. All of
+these should get included via HIPRTC builtins when the app links to HIPRTC
+library.
+
+Deprecation notice
+===============================================================================
+
+* Currently HIPRTC APIs are separated from HIP APIs and HIPRTC is available as a
+ separate library ``libhiprtc.so``/ ``libhiprtc.dll``. But on Linux, HIPRTC
+ symbols are also present in ``libamdhip64.so`` in order to support the
+ existing applications. Gradually, these symbols will be removed from HIP
+ library and applications using HIPRTC will be required to explicitly link to
+ HIPRTC library. However, on Windows ``hiprtc.dll`` must be used as the
+ ``amdhip64.dll`` doesn't contain the HIPRTC symbols.
+
+* Data types such as ``uint32_t``, ``uint64_t``, ``int32_t``, ``int64_t``
+ defined in std namespace in HIPRTC are deprecated earlier and are being
+ removed from ROCm release 6.1 since these can conflict with the standard
+ C++ data types. These data types are now prefixed with ``__hip__``, for example
+ ``__hip_uint32_t``. Applications previously using ``std::uint32_t`` or similar
+ types can use ``__hip_`` prefixed types to avoid conflicts with standard std
+ namespace or application can have their own definitions for these types. Also,
+ type_traits templates previously defined in std namespace are moved to
+ ``__hip_internal`` namespace as implementation details.
diff --git a/docs/how-to/hip_runtime_api/asynchronous.rst b/docs/how-to/hip_runtime_api/asynchronous.rst
index 81769da48e..82c024969f 100644
--- a/docs/how-to/hip_runtime_api/asynchronous.rst
+++ b/docs/how-to/hip_runtime_api/asynchronous.rst
@@ -136,7 +136,7 @@ This overlap of computation and data transfer ensures that the GPU is not idle
while waiting for data. :cpp:func:`hipMemcpyPeerAsync` enables data transfers
between different GPUs, facilitating multi-GPU communication.
-:ref:`async_example`` include launching kernels in one stream while performing
+:ref:`async_example` include launching kernels in one stream while performing
data transfers in another. This technique is especially useful in applications
with large data sets that need to be processed quickly.
diff --git a/docs/how-to/hip_runtime_api/cooperative_groups.rst b/docs/how-to/hip_runtime_api/cooperative_groups.rst
index 3170e197ef..a3e32cd294 100644
--- a/docs/how-to/hip_runtime_api/cooperative_groups.rst
+++ b/docs/how-to/hip_runtime_api/cooperative_groups.rst
@@ -164,7 +164,7 @@ The ``thread_rank()`` , ``size()``, ``cg_type()``, ``is_valid()``, ``sync()``, `
Coalesced groups
------------------
-Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled, if they do not meet the condition to execute that branch. The active threads referred as coalesced, and coalesced group represents an active thread group within a warp.
+Threads (64 threads on CDNA and 32 threads on RDNA) in a warp cannot execute different instructions simultaneously, so conditional branches are executed serially within the warp. When threads encounter a conditional branch, they can diverge, resulting in some threads being disabled if they do not meet the condition to execute that branch. The active threads are referred to as coalesced, and coalesced group represents an active thread group within a warp.
.. note::
diff --git a/docs/how-to/hip_runtime_api/error_handling.rst b/docs/how-to/hip_runtime_api/error_handling.rst
index 575f9dee81..a400ff97ec 100644
--- a/docs/how-to/hip_runtime_api/error_handling.rst
+++ b/docs/how-to/hip_runtime_api/error_handling.rst
@@ -37,6 +37,8 @@ Best practices of HIP error handling:
For more details on the error handling functions, see :ref:`error handling
functions reference page `.
+For a list of all error codes, see :ref:`HIP error codes `.
+
.. _hip_check_macros:
HIP check macros
diff --git a/docs/how-to/hip_runtime_api/memory_management/device_memory.rst b/docs/how-to/hip_runtime_api/memory_management/device_memory.rst
index 13fba386bb..54651a3f9f 100644
--- a/docs/how-to/hip_runtime_api/memory_management/device_memory.rst
+++ b/docs/how-to/hip_runtime_api/memory_management/device_memory.rst
@@ -69,34 +69,34 @@ better option, but is also limited in size.
.. code-block:: cpp
__global__ void kernel_memory_allocation(TYPE* pointer){
- // The pointer is stored in shared memory, so that all
- // threads of the block can access the pointer
- __shared__ int *memory;
-
- size_t blockSize = blockDim.x;
- constexpr size_t elementsPerThread = 1024;
- if(threadIdx.x == 0){
- // allocate memory in one contiguous block
- memory = new int[blockDim.x * elementsPerThread];
- }
- __syncthreads();
+ // The pointer is stored in shared memory, so that all
+ // threads of the block can access the pointer
+ __shared__ int *memory;
+
+ size_t blockSize = blockDim.x;
+ constexpr size_t elementsPerThread = 1024;
+ if(threadIdx.x == 0){
+ // allocate memory in one contiguous block
+ memory = new int[blockDim.x * elementsPerThread];
+ }
+ __syncthreads();
- // load pointer into thread-local variable to avoid
- // unnecessary accesses to shared memory
- int *localPtr = memory;
+ // load pointer into thread-local variable to avoid
+ // unnecessary accesses to shared memory
+ int *localPtr = memory;
- // work with allocated memory, e.g. initialization
- for(int i = 0; i < elementsPerThread; ++i){
- // access in a contiguous way
- localPtr[i * blockSize + threadIdx.x] = i;
- }
+ // work with allocated memory, e.g. initialization
+ for(int i = 0; i < elementsPerThread; ++i){
+ // access in a contiguous way
+ localPtr[i * blockSize + threadIdx.x] = i;
+ }
- // synchronize to make sure no thread is accessing the memory before freeing
- __syncthreads();
- if(threadIdx.x == 0){
- delete[] memory;
+ // synchronize to make sure no thread is accessing the memory before freeing
+ __syncthreads();
+ if(threadIdx.x == 0){
+ delete[] memory;
+ }
}
-}
Copying between device and host
--------------------------------------------------------------------------------
diff --git a/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst b/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst
index 91f951b296..b771b8c902 100644
--- a/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst
+++ b/docs/how-to/hip_runtime_api/memory_management/virtual_memory.rst
@@ -25,6 +25,10 @@ issue of reallocation when the extra buffer runs out.
Virtual memory management solves these memory management problems. It helps to
reduce memory usage and unnecessary ``memcpy`` calls.
+HIP virtual memory management is built on top of HSA, which provides low-level
+access to AMD GPU memory. For more details on the underlying HSA runtime,
+see :doc:`ROCr documentation `
+
.. _memory_allocation_virtual_memory:
Memory allocation
diff --git a/docs/how-to/logging.rst b/docs/how-to/logging.rst
index ecf40fa192..3c8b8c5a53 100644
--- a/docs/how-to/logging.rst
+++ b/docs/how-to/logging.rst
@@ -240,3 +240,16 @@ information when calling the backend runtime.
:3:C:\constructicon\builds\gfx\two\22.40\drivers\compute\hipamd\src\hip_memory.cpp:681 : 605414524092 us: 29864: [tid:0x9298] hipMemGetInfo: Returned hipSuccess :
memInfo.total: 12.06 GB
memInfo.free: 11.93 GB (99%)
+
+Logging hipcc commands
+================================================================================
+
+To see the detailed commands that hipcc issues, set the environment variable
+``HIPCC_VERBOSE``. Doing so will print the HIP-clang (or NVCC) commands that
+hipcc generates to ``stderr``.
+
+.. code-block:: shell
+
+ export HIPCC_VERBOSE=1
+ hipcc main.cpp
+ hipcc-cmd: /opt/rocm/lib/llvm/bin/clang++ --offload-arch=gfx90a --driver-mode=g++ -O3 --hip-link -x hip main.cpp
diff --git a/docs/index.md b/docs/index.md
index 7678aaae79..d47962d6fe 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -22,7 +22,6 @@ The HIP documentation is organized into the following categories:
:::{grid-item-card} Programming guide
-* [Introduction](./programming_guide)
* {doc}`./understand/programming_model`
* {doc}`./understand/hardware_implementation`
* {doc}`./understand/compilers`
@@ -42,12 +41,13 @@ The HIP documentation is organized into the following categories:
:::{grid-item-card} Reference
* [HIP runtime API](./reference/hip_runtime_api_reference)
-* [HSA runtime API for ROCm](./reference/virtual_rocr)
* [HIP math API](./reference/math_api)
+* [HIP complex math API](./reference/complex_math_api)
* [HIP environment variables](./reference/env_variables)
+* [HIP error codes](./reference/error_codes)
* [CUDA to HIP API Function Comparison](./reference/api_syntax)
* [List of deprecated APIs](./reference/deprecated_api_list)
-* [FP8 numbers in HIP](./reference/fp8_numbers)
+* [Low Precision Floating Point Types](./reference/low_fp_types)
* {doc}`./reference/hardware_features`
:::
diff --git a/docs/install/build.rst b/docs/install/build.rst
index 64deba241b..b0a7baa43d 100644
--- a/docs/install/build.rst
+++ b/docs/install/build.rst
@@ -9,27 +9,28 @@ Build HIP from source
Prerequisites
=================================================
-HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler, or a CUDA platform with ``nvcc`` installed.
-Before building and running HIP, make sure drivers and prebuilt packages are installed properly on the platform.
+HIP code can be developed either on AMD ROCm platform using HIP-Clang compiler,
+or a CUDA platform with ``nvcc`` installed. Before building and running HIP,
+make sure drivers and prebuilt packages are installed properly on the platform.
You also need to install Python 3, which includes the ``CppHeaderParser`` package.
Install Python 3 using the following command:
.. code-block:: shell
- apt-get install python3
+ apt-get install python3
Check and install ``CppHeaderParser`` package using the command:
.. code-block:: shell
- pip3 install CppHeaderParser
+ pip3 install CppHeaderParser
Install ``ROCm LLVM`` package using the command:
.. code-block:: shell
- apt-get install rocm-llvm-dev
+ apt-get install rocm-llvm-dev
.. _Building the HIP runtime:
@@ -41,201 +42,209 @@ Set the repository branch using the variable: ``ROCM_BRANCH``. For example, for
.. code-block:: shell
- export ROCM_BRANCH=rocm-6.1.x
+ export ROCM_BRANCH=rocm-6.1.x
.. tab-set::
- .. tab-item:: AMD
- :sync: amd
+ .. tab-item:: AMD
+ :sync: amd
- #. Get HIP source code.
+ #. Get HIP source code.
- .. note::
- Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and
- OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on.
+ .. note::
+
+ Starting in ROCM 5.6, CLR is a new repository that includes the former ROCclr, HIPAMD and
+ OpenCl repositories. OpenCL provides headers that ROCclr runtime depends on.
- .. note::
- Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP.
- ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms,
- like NVIDIA.
+ .. note::
- .. code-block:: shell
+ Starting in ROCM 6.1, a new repository ``hipother`` is added to ROCm, which is branched out from HIP.
+ ``hipother`` provides files required to support the HIP back-end implementation on some non-AMD platforms,
+ like NVIDIA.
- git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
- git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
+ .. code-block:: shell
- CLR (Compute Language Runtime) repository includes ROCclr, HIPAMD and OpenCL.
+ git clone -b "$ROCM_BRANCH" https://github.com/ROCm/clr.git
+ git clone -b "$ROCM_BRANCH" https://github.com/ROCm/hip.git
- ROCclr (ROCm Compute Language Runtime) is a virtual device interface which
- is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends.
+ CLR (Compute Language Runtime) repository includes ROCclr, HIPAMD and OpenCL.
- HIPAMD provides implementation specifically for HIP on the AMD platform.
+ ROCclr (ROCm Compute Language Runtime) is a virtual device interface which
+ is defined on the AMD platform. HIP runtime uses ROCclr to interact with different backends.
- OpenCL provides headers that ROCclr runtime currently depends on.
- hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA.
+ HIPAMD provides implementation specifically for HIP on the AMD platform.
- #. Set the environment variables.
+ OpenCL provides headers that ROCclr runtime currently depends on.
+ hipother provides headers and implementation specifically for non-AMD HIP platforms, like NVIDIA.
- .. code-block:: shell
+ #. Set the environment variables.
- export CLR_DIR="$(readlink -f clr)"
- export HIP_DIR="$(readlink -f hip)"
+ .. code-block:: shell
+ export CLR_DIR="$(readlink -f clr)"
+ export HIP_DIR="$(readlink -f hip)"
- #. Build HIP.
- .. code-block:: shell
+ #. Build HIP.
- cd "$CLR_DIR"
- mkdir -p build; cd build
- cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
+ .. code-block:: shell
- make -j$(nproc)
- sudo make install
+ cd "$CLR_DIR"
+ mkdir -p build; cd build
+ cmake -DHIP_COMMON_DIR=$HIP_DIR -DHIP_PLATFORM=amd -DCMAKE_PREFIX_PATH="/opt/rocm/" -DCMAKE_INSTALL_PREFIX=$PWD/install -DHIP_CATCH_TEST=0 -DCLR_BUILD_HIP=ON -DCLR_BUILD_OCL=OFF ..
- .. note::
+ make -j$(nproc)
+ sudo make install
- Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at
- ````.
+ .. note::
- By default, release version of HIP is built. If need debug version, you can put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
+ Note, if you don't specify ``CMAKE_INSTALL_PREFIX``, the HIP runtime is installed at
+ ````.
- Default paths and environment variables:
+ By default, release version of HIP is built. If need debug version, you can
+ put the option ``CMAKE_BUILD_TYPE=Debug`` in the command line.
- * HIP is installed into ````. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
- environment variable.
- * HSA is in ````. This can be overridden by setting the ``HSA_PATH``
- environment variable.
- * Clang is in ``/llvm/bin``. This can be overridden by setting the
- ``HIP_CLANG_PATH`` environment variable.
- * The device library is in ``/lib``. This can be overridden by setting the
- ``DEVICE_LIB_PATH`` environment variable.
- * Optionally, you can add ``/bin`` to your ``PATH``, which can make it easier to
- use the tools.
- * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
+ Default paths and environment variables:
- After you run the ``make install`` command, HIP is installed to ```` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
+ * HIP is installed into ````. This can be overridden by setting the ``INSTALL_PREFIX`` as the command option.
+
+ * HSA is in ````. This can be overridden by setting the ``HSA_PATH`` environment variable.
+
+ * Clang is in ``/llvm/bin``. This can be overridden by setting the ``HIP_CLANG_PATH`` environment variable.
+
+ * The device library is in ``/lib``. This can be overridden by setting the ``DEVICE_LIB_PATH`` environment variable.
+
+ * Optionally, you can add ``/bin`` to your ``PATH``, which can make it easier to use the tools.
+
+ * Optionally, you can set ``HIPCC_VERBOSE=7`` to output the command line for compilation.
- #. Generate a profiling header after adding/changing a HIP API.
+ After you run the ``make install`` command, HIP is installed to ```` by default, or ``$PWD/install/hip`` while ``INSTALL_PREFIX`` is defined.
- When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
- This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
+ #. Generate a profiling header after adding/changing a HIP API.
- To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
- ``hipamd/src``.
+ When you add or change a HIP API, you may need to generate a new ``hip_prof_str.h`` header.
+ This header is used by ROCm tools to track HIP APIs, such as ``rocprofiler`` and ``roctracer``.
- Usage:
+ To generate the header after your change, use the ``hip_prof_gen.py`` tool located in
+ ``hipamd/src``.
- .. code-block:: shell
+ Usage:
- `hip_prof_gen.py [-v] [