Kernel Database Technical Reference: rocBLAS and hipBLASLt

Complete File Listing and Patterns

hipBLASLt gfx1100 Library Sample Files

Total: 304 files, 42 MB

Category 1: Main Kernels (4 files)

-rw-r--r--  1.4M  hipblasltTransform.hsaco      # Shared matrix transform kernels
-rw-r--r--  4.3K  hipblasltExtOpLibrary.dat     # Extended operations metadata
-rw-r--r--  3.7M  Kernels.so-000-gfx1100.hsaco # Architecture-specific kernel archive
-rw-r--r--   41K  extop_gfx1100.co              # Extended operations (LayerNorm, Softmax, AMax)

Category 2: Tensor Contraction Kernels (300 files, organized by configuration)

Each has a paired .co (compiled) and .dat (metadata) file:

# Example family: Type BB HPA Contraction with various parameter layouts
TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.co    [138K]
TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.dat   [201K]

TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx1100.co    [208K]
TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_gfx1100.dat   [284K]

TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_gfx1100.co    [244K]
TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Alik_Bjlk_Cijk_Dijk_gfx1100.dat   [284K]

[... approximately 150 additional pairs with different configurations ...]

rocBLAS gfx1100 Library Sample Files

Total: 439 files, 84 MB

Category 1: Core Kernels (4 files)

-rw-r--r--  262K  Kernels.so-000-gfx1100.hsaco
-rw-r--r--   24K  TensileLibrary_lazy_gfx1100.dat
-rw-r--r--  42K   TensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback.dat
-rw-r--r--  195K  TensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback_gfx1100.hsaco

Category 2: Type-Specific Variants with Fallbacks

# Int8 Variants
TensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback.dat
TensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback_gfx1100.hsaco [195K]
TensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback.dat
TensileLibrary_Type_4xi8I_HPA_Contraction_l_Ailk_Bljk_Cijk_Dijk_fallback_gfx1100.hsaco [163K]

# BFloat16 Variants (no .co files for fallbacks)
TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback.dat
TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback_gfx1100.hsaco [590K]
TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.co              [323K]
TensileLibrary_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.dat             [519K]

[... approximately 100+ additional variants ...]

File Format Specifications

HSACO Format Details

File Type: ELF 64-bit LSB shared object (AMD GPU binary)

Example header analysis:

7f 45 4c 46        # ELF magic number
02                 # 64-bit
01                 # Little endian
01                 # System V ABI

Usage:

Contains compiled GPU machine code
Can be loaded directly by HIP runtime
Multiple HSACO files may be concatenated in some cases
BuildID included for versioning

Size Pattern:

hipBLASLt kernels: typically 3.7 MB for main kernel archive
rocBLAS kernels: typically 160 KB - 590 KB for specific implementations
Fallback implementations: often larger, ~500+ KB

CO Format Details

File Type: ELF 64-bit LSB shared object (AMD GPU architecture specific)

Usage:

Individual compiled kernel objects
Smaller, more granular than HSACO
Loaded together with corresponding .dat metadata

Size Pattern:

extop kernels: 40-50 KB
Contraction kernels: 100-600 KB
Typically smaller than fallback HSACO files

Naming Convention:

<operation>_<architecture>.co

DAT Format Details

File Type: MessagePack binary serialization

Hexdump Analysis:

Offset: 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f
00000: 83 a7 67 66 78 31 31 30 30 83 a9 4c 61 79 65 72
       ^^ - msgpack fixmap (3 entries)
          ^^^^^^^^^^^^^^^^^^ - key: "gfx1100" (7 bytes)
                             ^^^^^^^^^^^^^^^^^ - value: another msgpack object starting with 83 (3 entries)

Content Structure:

Top-level keys (examples from hipblasltExtOpLibrary.dat):
- "gfx1100", "gfx1101", "gfx1103" - per-architecture operation definitions
Per-architecture entries contain:
- "LayerNorm" - operation name
- "Softmax", "AMax", "MatrixTransform" - other operations
Per-operation entries contain:
- "func_name" - kernel function identifier
- "io_type" - input/output data type
- "limit" - work item limits
- "num_workitems" - work group configuration
- "co_path" - path to corresponding .co file

Example Structure:

{
  "gfx1100": {
    "LayerNorm": [
      {
        "func_name": "LayerNorm_DT_S_W_256_C_4_S_1",
        "io_type": "S",
        "limit": 4096,
        "num_workitems": 256,
        "co_path": "extop_gfx1100.co"
      },
      // ... more parameter variants
    ]
  }
}

Architecture-Specific Variants

hipBLASLt Supported Architectures

Located in /develop/therock/rocm-libraries/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/asm_full/:

aldebaran - CDNA 2 architecture
- 110CU variant
- 104CU variant
aquavanjaram - CDNA 3 architecture
- gfx942 base
- gfx942_20cu
- gfx942_38cu
- gfx942_64cu
- gfx942_80cu
- gfx942_152cu
- gfx942_228cu
RDNA Series
- gfx1150 - RDNA 2
- gfx1200 - RDNA 3
- gfx1201 - RDNA 3 variant
- gfx1103 - RDNA 3 variant
- gfx950 - RDNA 4 (with Origami, StreamK variants)
- navi31 - RDNA 3
- navi32 - RDNA 3
- navi33 - RDNA 3

rocBLAS Supported Architectures

Same pattern but installed versions typically include:

gfx1100, gfx1101, gfx1102, gfx1103 (RDNA 3 variants)
Architecture selection is architecture-specific and cannot be easily shared

Kernel Naming Component Breakdown

hipBLASLt Contraction Kernel Names

Pattern:

TensileLibrary_<datatype_spec>_<feature_flags>_Type_<precision>_Contraction_l_<memory_layout>_<gfx>.ext

Example:

TensileLibrary_BB_BB_HA_Bias_Aux_SAV_UA_Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_gfx1100.co
         ^^^^^^ ^^^^^^ ^^ ^^^^ ^^^ ^^^^^ ^^^^^^^ ^^^^^^^^^^^ ^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^^^^
         Type   In-Out Feature Flags Prec Layout Impl      L layout                 Arch

Component Meanings:

Input/Output Types: BB_BB_HA
- First BB: Matrix A input type
- Second BB: Matrix B input type
- HA: Half or Auxiliary type for output
- Possible values: BB (bfloat16), FP (float), HF (half), etc.
Feature Flags: Bias_Aux_SAV_UA
- Bias - Bias addition support
- Aux - Auxiliary buffer support
- SAV - Scaled Alpha Value
- UA - Unscaled Alpha
- Combined indicate kernel capabilities
Precision Type: Type_BB_HPA
- BB - bfloat16 type
- HPA - High Precision Accumulation
Memory Layout: l_Ailk_Bjlk_Cijk_Dijk
- Format: l_<MatrixA>_<MatrixB>_<MatrixC>_<MatrixD>
- Ailk - A uses i,l,k dimensions
- Bjlk - B uses j,l,k dimensions
- Cijk - C uses i,j,k dimensions
- Dijk - D uses i,j,k dimensions
- Variants: Bljk (transposed B), Alik (different A layout)

rocBLAS Type-Specific Naming

More compact pattern:

TensileLibrary_Type_<type_code>_<contraction>_<variant_suffix>_<gfx>.ext

Examples:

Type_4xi8I_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk_fallback - Int8 with fallback
Type_BB_HPA_Contraction_l_Ailk_Bjlk_Cijk_Dijk - Standard bfloat16 (no fallback suffix = optimized)

Runtime Loading and Selection

DAT File Role in Runtime

The .dat files serve as metadata indices:

Architecture Discovery
- Runtime reads .dat to identify available architectures
- Matches against detected GPU architecture
Kernel Selection
- For each operation, .dat contains parameter variants
- Runtime selects appropriate variant based on problem size
- Example: LayerNorm_DT_S_W_256_C_4 for specific weight/channel dimensions
Path Resolution
- .dat specifies which .co file implements each variant
- Runtime loads corresponding compiled kernel
Lazy Loading
- rocBLAS uses "lazy" loading indicated by TensileLibrary_lazy_.dat
- Kernels loaded on-demand rather than all at initialization

HSACO File Role in Runtime

Fallback Implementation
- Used when optimized .co implementation unavailable
- rocBLAS includes explicit fallback variants with fallback suffix
Comprehensive Kernel Archive
- hipBLASLt main Kernels.so file contains broad kernel coverage
- Supplements specific Tensile libraries
Direct Loading
- Can be loaded as complete shared object
- No metadata file needed (used for certain operations)

Size and Storage Implications

Space Usage by Category

hipBLASLt gfx1100 Breakdown

Kernels.so-000-gfx1100.hsaco    3.7 MB   (9%)
hipblasltTransform.hsaco         1.4 MB   (3%)
Tensile .dat files (151 total)   ~20 MB   (48%)
Tensile .co files (149 total)    ~18 MB   (43%)
Other metadata                   ~2 MB    (5%)
Total:                           42 MB

rocBLAS gfx1100 Breakdown

HSACO files (220 total)          ~75 MB   (89%)
DAT files (138 total)            ~7 MB    (8%)
CO files (80 total)              ~2 MB    (2%)
Lazy loading metadata            ~1 MB    (1%)
Total:                           84 MB

Scaling with Architecture Count

Adding each additional architecture:

hipBLASLt: +42 MB per architecture
rocBLAS: +84 MB per architecture

For 5 architectures (gfx1100, 1101, 1102, 1103, 1201):

hipBLASLt: 210 MB total
rocBLAS: 420 MB total
Combined: 630 MB for kernel databases alone

Parsing DAT Files Programmatically

Python Example

import msgpack

# Load DAT file
with open('hipblasltExtOpLibrary.dat', 'rb') as f:
    data = msgpack.unpackb(f.read(), raw=False)

# Iterate through architectures
for arch, operations in data.items():
    print(f"Architecture: {arch}")
    for op_name, variants in operations.items():
        print(f"  Operation: {op_name}")
        for variant in variants:
            print(f"    - {variant['func_name']}: {variant['co_path']}")

Key Fields to Extract

From each variant entry:

func_name - Unique kernel identifier
io_type - Data type (S=single, H=half, D=double)
limit - Work item limit
num_workitems - Workgroup size
co_path - Relative path to kernel binary
is_scale - Boolean scale operation support

Splitting Strategy for kpack

Architecture-Segregated Splitting

Recommended approach for separate kpack files:

rocm-hipblaslt-gfx1100-kernels.kpack
  /lib/hipblaslt/library/
    - Kernels.so-000-gfx1100.hsaco
    - extop_gfx1100.co
    - TensileLibrary_*_gfx1100.co (149 files)
    - TensileLibrary_*_gfx1100.dat (151 files)

rocm-hipblaslt-gfx1101-kernels.kpack
  /lib/hipblaslt/library/
    - Kernels.so-000-gfx1101.hsaco
    - extop_gfx1101.co
    - TensileLibrary_*_gfx1101.co (149 files)
    - TensileLibrary_*_gfx1101.dat (151 files)

rocm-hipblaslt-gfx1103-kernels.kpack
  /lib/hipblaslt/library/
    - (same pattern)

rocm-hipblaslt-shared.kpack
  /lib/hipblaslt/library/
    - hipblasltTransform.hsaco
    - hipblasltExtOpLibrary.dat

Size Reduction for gfx1201

Note: gfx1201 is significantly larger (320 MB vs 42 MB):

Option 1: Include only gfx1201-specific files (~75 MB)
Option 2: Use sparse optimization selection
Option 3: Compress entire library within kpack

Dependency Chains

When splitting, maintain:

.co files paired with corresponding .dat files
Architecture markers in filenames enable runtime selection
No file dependencies between architectures (clean separation)

Potential Optimization Opportunities

1. Compression Within kpack

Both .co and .hsaco files compress well (ELF format):

Typical compression ratio: 40-50%
hipBLASLt gfx1100: 42 MB -> 21-25 MB compressed
rocBLAS gfx1100: 84 MB -> 42-50 MB compressed

2. Archive Format Consolidation

Instead of 300+ individual files:

Create tar.gz per kernel family
Extract at installation time
Reduces filesystem overhead

3. Lazy Architecture Detection

Ship architecture discovery script
Identify target GPU at installation
Unpack only necessary architecture

4. Shared DAT Consolidation

hipblasltExtOpLibrary.dat is truly shared
rocBLAS lazy loading can reference generic fallbacks
Reduces redundancy in metadata

Testing and Validation

File Integrity Checks

# Verify ELF format
file /lib/hipblaslt/library/Kernels.so-000-gfx1100.hsaco

# Verify msgpack format
python3 -c "import msgpack; msgpack.unpackb(open('file.dat', 'rb').read())"

# Check architecture in filename
ls /lib/hipblaslt/library/ | grep -o 'gfx[0-9]*' | sort -u

Runtime Validation

After unpacking kpack:

Check file count matches expected
Verify all paired .co/.dat files present
Validate HSACO ELF headers
Confirm MessagePack .dat files parseable
Test kernel loading with HIP runtime

References

Installed hipBLASLt library: /home/stella/claude-rocm-workspace/rocm/gfx1100/lib/hipblaslt/library/ (42 MB, 304 files)
Installed rocBLAS library: /home/stella/claude-rocm-workspace/rocm/gfx1100/lib/rocblas/library/ (84 MB, 439 files)
Source Tensile Logic: /develop/therock/rocm-libraries/projects/hipblaslt/library/src/amd_detail/rocblaslt/src/Tensile/Logic/
MessagePack specification: https://msgpack.org/
AMD GPU Binary Format (HSACO): ROCm documentation

FilesExpand file tree

kernel-databases-technical-reference.md

Latest commit

History