rocBLAS and hipBLASLt maintain large kernel database libraries organized by GPU architecture. These are installed in library directories as collections of compiled kernel objects (.co), kernel metadata (.dat), and kernel archives (.hsaco). The databases contain pre-compiled kernels from Tensile and support for extended operations (extop).
-
gfx1100: 42 MB (304 files)
- 149 .co files
- 151 .dat files
- 4 .hsaco files
-
gfx1201: 320 MB (590 files)
- Proportionally larger due to more optimization variants
-
gfx1100: 84 MB (439 files)
- 80 .co files
- 138 .dat files
- 220 .hsaco files
- 1 .txt file
-
gfx1201: 21 MB
- Smaller than gfx1100 due to targeted optimization
/rocm/<arch>/lib/hipblaslt/library/
- Kernels.so-000-<arch>.hsaco [3.7 MB each, arch-specific]
- hipblasltTransform.hsaco [1.4 MB, shared]
- hipblasltExtOpLibrary.dat [4.3 KB, shared metadata]
- extop_<arch>.co [41 KB each, arch-specific]
- TensileLibrary_*.co [100 KB - 250 KB, arch-specific]
- TensileLibrary_*.dat [200 KB - 600 KB, arch-specific]
Example paths:
/home/stella/claude-rocm-workspace/rocm/gfx1100/lib/hipblaslt/library//home/stella/claude-rocm-workspace/rocm/gfx1201/lib/hipblaslt/library/
/rocm/<arch>/lib/rocblas/library/
- Kernels.so-000-<arch>.hsaco [262 KB each]
- TensileLibrary_lazy_<arch>.dat [24 KB each]
- TensileLibrary_Type_*_fallback_<arch>.hsaco [160 KB - 590 KB]
- TensileLibrary_Type_*.co [150 KB - 590 KB]
- TensileLibrary_Type_*.dat [24 KB - 610 KB]
Example paths:
/home/stella/claude-rocm-workspace/rocm/gfx1100/lib/rocblas/library//home/stella/claude-rocm-workspace/rocm/gfx1201/lib/rocblas/library/
- Type: ELF 64-bit LSB shared object (AMD GPU architecture)
- Purpose: Compiled kernel archive containing GPU machine code
- File Format: Binary ELF format with AMD GPU metadata
- Typical Size: 160 KB - 3.7 MB
- Example:
Kernels.so-000-gfx1100.hsaco(3.7 MB in hipBLASLt) - Example:
TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback_gfx1100.hsaco
- Type: ELF 64-bit LSB shared object (AMD GPU architecture)
- Purpose: Compiled kernel object for specific operations
- File Format: Binary ELF format
- Typical Size: 40 KB - 590 KB
- Example:
extop_gfx1100.co(41 KB - extended operations) - Example:
TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.co(323 KB)
- Type: MessagePack (msgpack) binary format with metadata
- Purpose: Kernel metadata/index database
- File Format: Binary serialized data structure (not plain text)
- Typical Size: 4 KB - 610 KB
- Content Example: Hexdump shows:
Indicates architecture (gfx1100), operation names (LayerNorm), function names, I/O types, work item limits, and paths to .co files
83 a7 67 66 78 31 31 30 30 ..gfx1100 83 a9 4c 61 79 65 72 4e ..LayerNorm
- Contains metadata for extended operations (LayerNorm, AMax, Softmax, etc.)
- References architecture-specific .co files
- Format: MessagePack serialized dictionary with operation definitions
Both libraries organize kernels by GPU architecture with clear naming conventions:
- gfx1100 (RDNA 3)
- gfx1101 (RDNA 3)
- gfx1103 (RDNA 3)
- gfx1100 (RDNA 3)
- gfx1101 (RDNA 3)
- gfx1102 (RDNA 3)
- gfx1103 (RDNA 3)
All architecture-specific files follow the pattern:
<operation>_<architecture>.ext
TensileLibrary_<type>_<arch>.ext
Examples:
Kernels.so-000-gfx1100.hsaco- Kernel archive for gfx1100TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.coTensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1101.dat
Kernels.so-000-<gfx>.hsaco- Main kernel archivehipblasltTransform.hsaco- Matrix transform kernels (shared across architectures)
extop_<gfx>.co- Extended operation kernels (LayerNorm, Softmax, AMax, etc.)hipblasltExtOpLibrary.dat- Metadata index for extended operations
Naming pattern: TensileLibrary_<datatype>_<contraction_spec>_<gfx>.<ext>
Example components:
BB_BB_HA- Data type specifications (bfloat16, bfloat16, half)Bias_Aux_SAV_UA- Feature flags (bias, auxiliary, scaled alpha/beta, unscaled alpha)Type_BB_HPA- Precision specificationContraction_l_Ailk_Bjlk_Cijk_Dijk- Contraction layout specificationAilk= Matrix A layoutBjlk= Matrix B layoutCijk= Matrix C layoutDijk= Matrix D (output) layout
Each configuration has:
.cofile (compiled kernel) - ~140-250 KB.datfile (metadata) - ~200-300 KB
Kernels.so-000-<gfx>.hsaco- Main kernel archive (~262 KB per arch)
TensileLibrary_lazy_<gfx>.dat- Lazy loading metadata (~24 KB per arch)
Pattern: TensileLibrary_Type_<spec>_<gfx>.<ext>
Examples with variants:
-
4xi8I Contraction (int8 data types)
- Fallback variants:
_fallback_<gfx>.hsaco - Optimized variants:
_<gfx>.coand_<gfx>.dat
- Fallback variants:
-
BB Contraction (bfloat16 or similar)
- Fallback variants:
_fallback_<gfx>.hsaco - Architecture-specific:
_<gfx>.coand_<gfx>.dat - File sizes vary: .co (150-600 KB), .dat (230-610 KB)
- Fallback variants:
Similar to hipBLASLt:
Ailk_Bjlk_Cijk_Dijk- Standard row-major contractionAilk_Bljk_Cijk_Dijk- Variant with transposed BAlik_Bjlk_Cijk_Dijk- Variant with different A layoutAlik_Bljk_Cijk_Dijk- Combined layout variant
/develop/therock/rocm-libraries/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/
asm_full/
aldebaran/
gfx1100/
gfx1201/
gfx1150/
gfx1201/
aquavanjaram/
gfx942/
gfx942_20cu/
gfx942_38cu/
gfx942_64cu/
gfx942_80cu/
gfx942_152cu/
gfx942_228cu/
navi31/
navi32/
navi33/
gfx950/
Each architecture directory contains:
- GridBased/ - Grid-based kernel implementations
- Equality/ - Equality-based implementations
- Origami/ - Origami optimization variants (gfx950)
- StreamK/ - StreamK implementation (gfx942)
/develop/therock/rocm-libraries/projects/rocblas/library/src/
blas1/ - Level 1 BLAS (vector operations)
blas2/ - Level 2 BLAS (matrix-vector operations)
blas3/ - Level 3 BLAS (matrix-matrix operations)
Tensile/
Logic/ - Architecture-specific kernel logic
blas_ex/ - Extended precision operations
-
Architecture Isolation
- Each GPU architecture (gfx1100, gfx1101, etc.) has separate .co and .dat files
- .hsaco files may be combined or separate per architecture
- Shared files (like hipblasltTransform.hsaco) minimize duplication
-
Paired Files (.co + .dat)
- Most Tensile kernels are paired:
name.co(kernel code) +name.dat(metadata) - The .dat file contains configuration and references to the .co file
- Allows lazy loading and kernel selection at runtime
- Most Tensile kernels are paired:
-
Scalability
- hipBLASLt: 304 files per architecture (42 MB)
- rocBLAS: 439 files per architecture (84 MB)
- Multiple architectures mean files multiply: 2 architectures = ~600-800 files total
-
File Format Consistency
- HSACO: Standard ELF format for compiled GPU code
- CO: ELF format for individual kernel objects
- DAT: MessagePack serialized configuration format
- Kernel databases are architecture-specific and cannot be easily shared
- Installation includes all variants regardless of target architecture
- No runtime filtering for target architecture
- Separate by Architecture: Each architecture (gfx1100, gfx1201, etc.) should have separate kernel packages
- Handle Paired Files: When splitting, keep .co and .dat files paired
- Shared Files: Identify genuinely shared files (hipblasltTransform.hsaco) and handle separately
- Compression: With hundreds of similar-sized files, consider archive-based storage to reduce overhead
- Metadata Extraction: Parse .dat files (MessagePack) to understand kernel dependencies and operation mappings
- Single architecture hipBLASLt: ~14 MB per architecture (42 MB / 3 architectures)
- Single architecture rocBLAS: ~21 MB per architecture (84 MB / 4 architectures)
- Shared overhead: ~1-2 MB per library
- hipBLASLt Library CMakeLists:
/develop/therock/rocm-libraries/projects/hipblaslt/library/ - rocBLAS Library CMakeLists:
/develop/therock/rocm-libraries/projects/rocblas/library/ - Tensile Logic:
/develop/therock/rocm-libraries/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/
- hipBLASLt:
/home/stella/claude-rocm-workspace/rocm/<gfx>/lib/hipblaslt/library/ - rocBLAS:
/home/stella/claude-rocm-workspace/rocm/<gfx>/lib/rocblas/library/
- HSACO: ELF 64-bit LSB shared object (AMD GPU binary format)
- CO: ELF 64-bit LSB shared object (individual kernel)
- DAT: MessagePack binary serialization format