When to use: Segmentation faults, GPU kernel failures, assertion errors, crashes during execution
Based on: Analysis of runtime failures including flaky multi-node issues and kernel selection problems
Runtime errors occur AFTER successful compilation:
- β Code compiled successfully
- β Binary/library exists
- β Crashes/errors during execution
NOT runtime errors (different checklist):
- Build failures β Use
checklist_build_failures.md - CMake configuration issues β Use
checklist_build_failures.md - Import errors (Python) β May be build or environment issue
Start here to determine your path:
1. Can you reproduce the error?
ββ YES, 100% of the time β Go to Section 3 (Versions)
ββ NO, it's flaky/random β Go to Section 2 (Flaky Errors)
2. Do you have a last known good build/commit?
ββ YES β Go to Section 4 (Regression Analysis) FIRST
ββ NO β Continue to Section 3
3. Is this a recent issue or long-standing?
ββ Recent (last 2 weeks) β Section 4 (Regression)
ββ Long-standing β Section 3 (Versions) then Section 5 (Debug)
4. Ready to file issue?
β Go to Section 7 (issue Requirements Summary)
-
Segmentation Fault (SIGSEGV)
Example: "Segmentation fault (core dumped)" Example: "Fatal Python error: Segmentation fault" -
GPU Kernel Error
Example: "hipErrorLaunchFailure: Unspecified launch failure" Example: "HSA_STATUS_ERROR_MEMORY_FAULT" -
Assertion Failure
Example: "Assertion `x != nullptr' failed" -
Memory Error
Example: "Out of memory" Example: "hipErrorMemoryAllocation" -
Numeric/Accuracy Error
Example: "NaN detected in output"
Full stack trace: (required)
Key information:
- Component:
_________________(hipblasLt, Tensile, rocBLAS, etc.) - Function:
_________________ - Kernel name (if GPU):
_________________
Common stack trace patterns:
| Pattern | Likely Cause | Route To |
|---|---|---|
hipblasLtMatmulAlgoGetHeuristic |
Kernel selection issue | Math team |
TensileHost::solutionSelection |
Tensile library issue | Math team |
rcclAllReduce |
Communication layer | RCCL team |
hipMalloc / hipMemcpy |
Memory allocation | ROCm Runtime |
If error is NOT 100% reproducible:
Reproduction rate: _____%
Test at least 20 times to establish baseline:
#!/bin/bash
TOTAL=20
FAILED=0
for i in $(seq 1 $TOTAL); do
./run_test.sh > run_$i.log 2>&1 || ((FAILED++))
done
echo "Failure rate: $((FAILED * 100 / TOTAL))%"Minimum to file issue:
- Failure rate >10% (if <10%, too unreliable to debug)
- At least 3 failure logs captured
- At least 3 success logs captured (for comparison)
If failure rate increases with scale, complete this:
| Configuration | Runs | Failures | Rate |
|---|---|---|---|
| 1 node, 1 GPU | 20 | ___ | ___% |
| 1 node, 8 GPUs | 20 | ___ | ___% |
| 4 nodes | 20 | ___ | ___% |
| 16 nodes | 10 | ___ | ___% |
| 32+ nodes | 10 | ___ | ___% |
Pattern indicates:
- Rate increases with nodes β Race condition/communication issue
- Rate increases with GPUs β GPU synchronization issue
- Rate constant β Timing-independent bug
When does it fail:
- During initialization
- During compilation (JIT/JAX)
- During kernel execution
- Random timing
Search entire system for installations:
# Search everywhere for hipBLASLt
find / -name "libhipblaslt.so*" 2>/dev/null
find / -name "hipblaslt-version.h" 2>/dev/null
# Search for Tensile
find / -name "libtensile.so*" 2>/dev/null
find / -name "Tensile" -type d 2>/dev/null | head -20
# Search for rocBLAS
find / -name "librocblas.so*" 2>/dev/nullCheck which libraries are actually being loaded:
# For Python workloads:
ldd $(which python3) | grep -E "rocm|hip|hsa|hipblaslt|rocblas|tensile"
# For hipblaslt-bench:
ldd /opt/rocm/bin/hipblaslt-bench | grep -E "hipblaslt|rocblas|tensile"
# For other binaries (adjust path as needed):
ldd /path/to/your/binary | grep -E "hipblaslt|rocblas|tensile"Multiple installations found? Yes No
If multiple installations found:
- DO NOT PROCEED - Report findings in issue
- List all locations found
- Document which libraries are being loaded (from ldd output)
- Let Math team determine resolution
Required in issue if multiple found:
Locations found:
- /opt/rocm-6.0.0/lib/libhipblaslt.so
- /opt/rocm-7.0.2/lib/libhipblaslt.so
- /home/user/local/lib/libhipblaslt.so
Libraries being loaded (from ldd):
[paste ldd output]
ROCm Version:
cat /opt/rocm/.info/version
# Output: _______hipBLASLt Version from package:
dpkg -l | grep hipblaslt # Ubuntu/Debian
rpm -qa | grep hipblaslt # RHEL/CentOS
# Output: _______hipBLASLt Version from header:
cat /opt/rocm/include/hipblaslt/hipblaslt-version.h | grep HIPBLASLT_VERSION
# Output:
#define HIPBLASLT_VERSION_MAJOR _______
#define HIPBLASLT_VERSION_MINOR _______
#define HIPBLASLT_VERSION_PATCH _______
#define HIPBLASLT_VERSION_TWEAK _______If building from source:
cd /workspace/rocm-libraries
git log -1 --format="Commit: %H%nDate: %ci"
# Output: _______Check for duplicate symbols and ABI breaks:
# Check rocBLAS for symbol conflicts
nm -D /opt/rocm/lib/librocblas.so | grep -E "hipblasLt|Tensile" > rocblas_symbols.txt
# Check hipSPARSELt for symbol conflicts
nm -D /opt/rocm/lib/libhipsparselt.so 2>/dev/null | grep -E "hipblasLt|Tensile" > hipsparselt_symbols.txt
# Check for ODR violations (One Definition Rule)
nm -D /opt/rocm/lib/lib*.so | grep " T " | sort | uniq -d > duplicate_symbols.txt
# Check ABI/SONAME compatibility
readelf -d /opt/rocm/lib/libhipblaslt.so | grep SONAME
readelf -d /opt/rocm/lib/librocblas.so | grep SONAME
readelf -d /opt/rocm/lib/libhipsparselt.so 2>/dev/null | grep SONAMESymbol conflicts or ODR violations found? Yes No
If yes, attach to issue:
- rocblas_symbols.txt
- hipsparselt_symbols.txt
- duplicate_symbols.txt
- Output of all readelf commands
- GPU Model:
_______(MI300X, MI355X, MI325X) - GFX IP:
_______(gfx942, gfx950) - Nodes:
___, GPUs per node:___, Total:___ - OS:
_______(Ubuntu 22.04, RHEL 9) - Framework:
_______version_______(if applicable)
Provide ONE of:
Docker (preferred):
docker pull <registry>/<image>:<tag>
docker run -it --device=/dev/kfd --device=/dev/dri \
--group-add video --cap-add=SYS_PTRACE <image> /bin/bashBare Metal:
- Machine:
_______, Reservation:_______, Valid until:_______
Last Known Good:
- Build/Commit:
_______ - Date:
_______
First Known Bad:
- Build/Commit:
_______ - Date:
_______
This is THE most valuable information for debugging
Run BOTH versions with debug flags and compare:
# Last Good
git checkout <good-commit>
./install.sh -idc
export TENSILE_DB=0x8040
export HIPBLASLT_LOG_LEVEL=4
./run_test.sh > good_run.log 2>&1
# First Bad
git checkout <bad-commit>
./install.sh -idc
export TENSILE_DB=0x8040
export HIPBLASLT_LOG_LEVEL=4
./run_test.sh > bad_run.log 2>&1
# Compare kernel selections
diff <(grep -E "hipblasLt|Tensile" good_run.log) \
<(grep -E "hipblasLt|Tensile" bad_run.log) > diff.txtAttach to issue:
- good_run.log
- bad_run.log
- diff.txt
Git Bisect (if building from source):
git bisect start
git bisect bad <bad-commit>
git bisect good <good-commit>
# Test each checkout
git bisect log > bisect.log # Attach this| Aspect | Last Good | First Bad |
|---|---|---|
| ROCm | _______ | _______ |
| hipBLASLt | _______ | _______ |
| Tensile | _______ | _______ |
| Framework | _______ | _______ |
Set these environment variables BEFORE running test:
# MANDATORY: Kernel selection logging
export TENSILE_DB=0x8040
# MANDATORY: hipBLASLt verbose logging
export HIPBLASLT_LOG_LEVEL=4
export HIPBLASLT_LOG_MASK=0xFFFFFFFF
# Helpful additional flags:
export AMD_LOG_LEVEL=4 # HIP runtime
export HIP_VISIBLE_DEVICES=0 # Single GPU testRun test and capture output:
./run_test.sh > debug_run.log 2>&1
# OR
python3 train.py > debug_run.log 2>&1Required: Attach debug_run.log to issue
Can you reproduce with hipblaslt-bench?
hipblaslt-bench -f gemm_strided_batched \
-m <M> -n <N> -k <K> \
--batch_count <batch> \
-r <precision> \
-i <iterations>
# Example:
hipblaslt-bench -f gemm_strided_batched \
-m 4096 -n 4096 -k 4096 \
--batch_count 8 -r f16_r -i 100Result: Reproduces Does not reproduce Not tested
If reproduces: Attach exact command to issue
# Enable core dumps
ulimit -c unlimited
# Run test (will create core file on crash)
./run_test.sh
# Analyze with gdb
gdb /path/to/binary core.xxx
(gdb) bt full
(gdb) info threadsAttach to issue:
- Core dump:
http://______ - GDB backtrace (paste in issue)
# Rebuild with Address Sanitizer
./install.sh -c --address-sanitizer
# Run test
export ASAN_OPTIONS=detect_leaks=0:log_path=asan.log
./run_test.shIf ASAN detects issues: Attach asan.log
Only complete if failure is multi-node specific
| Config | Result |
|---|---|
| 1 node | Pass Fail |
| 2 nodes | Pass Fail |
| 4 nodes | Pass Fail |
| 16 nodes | Pass Fail |
Pattern: Fails only above ___ nodes
Before filing issue, ensure you have:
Core Information:
- Error type and stack trace
- Reproducibility (deterministic or flaky with %)
- ROCm version:
cat /opt/rocm/.info/version - hipBLASLt version:
dpkg -l | grep hipblaslt - rocm-libraries commit (if source build):
git log -1 --format="%H"
Debug Logs:
- Ran with
TENSILE_DB=0x8040andHIPBLASLT_LOG_LEVEL=4 - Full execution log attached:
http://______ - If regression: Both good and bad logs attached
Environment:
- Docker image:
docker pull <image> - OR Bare metal: Machine details
Regression Info (if available):
- Last good build/commit:
___ - First bad build/commit:
___ - Logs from both versions
- Kernel selection comparison
For Segfaults:
- Core dump available
- GDB backtrace attached
For Flaky Errors:
- Tested 20+ times
- Failure rate >10%
- Multiple failure logs
- Multiple success logs (for comparison)
For GEMM Issues:
- hipblaslt-bench command (if reproducible)
- Kernel selection logs
**Title:** [Component][Error Type][GPU] Brief description
**ERROR TYPE:** [Segfault/GPU Kernel/Assertion/Memory/Other]
**REPRODUCIBILITY:**
[Deterministic 100%] OR [Flaky ___% - increases with scale]
**OBSERVED:**
[Stack trace]
**EXPECTED:**
[What should happen]
**IMPACT:**
[Who/what is blocked]
**VERSIONS:**
- ROCm: ___ (cat /opt/rocm/.info/version)
- hipBLASLt: ___ (dpkg -l | grep hipblaslt)
- rocm-libraries commit: ___ (if source build)
- Docker image: ___ (if using Docker)
- GPU: ___ (MI300X/MI355X/etc), GFX: ___
- Nodes/GPUs: ___ nodes, ___ GPUs/node
- Framework: ___ version ___
**MULTIPLE INSTALLATIONS CHECK:**
find /opt -name "rocm*" -type d β ___
# Multiple found: Yes/No
# If yes, which used: ___
**REGRESSION INFO:**
- Last known good: ___ (date: ___)
- First known bad: ___ (date: ___)
- Suspect change: ___
- Bisection: [Done/Not done]
**DEBUG LOGS (with TENSILE_DB=0x8040):**
- Good case (if regression): http://___
- Failing case: http://___
- Kernel comparison: http://___
**GEMM REPRODUCTION:**
hipblaslt-bench -f gemm_strided_batched \
-m ___ -n ___ -k ___ \
--batch_count ___ -r ___ -i ___
# Result: [Reproduces/Does not/Not tested]
**CORE DUMP (if segfault):**
- Location: http://___
- GDB backtrace:
[Paste backtrace]
**Environment:**
Docker: `docker pull <image>`
OR Bare metal: Machine details
**FLAKY ERROR DATA (if applicable):**
| Config | Runs | Failures | Rate |
|--------|------|----------|------|
| 1 node | ___ | ___ | ___% |
| 4 nodes | ___ | ___ | ___% |
| 16 nodes | ___ | ___ | ___% |
**ADDITIONAL CONTEXT:**
[Workarounds attempted, hypotheses, etc.]START: Runtime error occurred
ββ Stack trace has hipblaslt/tensile/rocblas?
β ββ YES β Continue
β ββ NO β Not Math library issue
β
ββ Can reproduce with hipblaslt-bench?
β ββ YES β Math team
β ββ NO β Check next
β
ββ Error during kernel selection (check with TENSILE_DB=0x8040)?
β ββ YES β Math team
β ββ NO β Check next
β
ββ Error in framework (JAX/PyTorch) code?
β ββ YES β Framework team (but attach hipBLASLt logs)
β ββ NO β Math team (if mentions our libraries)
β
ββ Still unclear?
β File with Math team, include ALL debug info
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
RUNTIME ERROR TRIAGE QUICK GUIDE
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BEFORE FILING issue:
β Capture full stack trace
β Get ROCm version: cat /opt/rocm/.info/version
β Get commit: git log -1 --format="%H" (if source build)
β Check for multiple ROCm installations
REGRESSION? (if you know last good version):
β Run BOTH good and bad with TENSILE_DB=0x8040
β Compare kernel selections
β Attach logs from both
FLAKY ERROR? (if not 100% reproducible):
β Test 20+ times, document failure rate
β Test at different scales (nodes/GPUs)
β Capture 3+ failure logs, 3+ success logs
ALWAYS RUN WITH DEBUG FLAGS:
β export TENSILE_DB=0x8040
β export HIPBLASLT_LOG_LEVEL=4
β export HIPBLASLT_LOG_MASK=0xFFFFFFFF
β Attach full log
FOR SEGFAULTS:
β ulimit -c unlimited β capture core dump
β gdb <binary> core β bt full
β Optional: rebuild with --address-sanitizer
FOR GEMM ISSUES:
β Try hipblaslt-bench reproduction
β Include exact bench command if reproduces
MACHINE ACCESS:
β Docker: docker pull <image>
β OR Bare metal: Machine/Reservation details
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β This section is FAQ-worthy - consider posting to wiki/docs
# MANDATORY for kernel issues:
export TENSILE_DB=0x8040 # Kernel selection logging
# Individual flags (can combine with |):
export TENSILE_DB=0x1 # Library selection
export TENSILE_DB=0x2 # Solution selection
export TENSILE_DB=0x4 # Kernel execution
export TENSILE_DB=0x8 # Memory operations
export TENSILE_DB=0x10 # Timing info
export TENSILE_DB=0x40 # Kernel selection (included in 0x8040)
export TENSILE_DB=0x8000 # Problem dimensions (included in 0x8040)
# Combine flags: 0x8040 = 0x8000 | 0x40# Logging level (0=off, 1=error, 2=warn, 3=info, 4=trace)
export HIPBLASLT_LOG_LEVEL=4
# Log mask - all subsystems:
export HIPBLASLT_LOG_MASK=0xFFFFFFFF
# Specific subsystems (can combine with |):
export HIPBLASLT_LOG_MASK=0x1 # API calls
export HIPBLASLT_LOG_MASK=0x2 # Kernel selection
export HIPBLASLT_LOG_MASK=0x4 # Performance data# HIP runtime logging (0-4, 4=most verbose)
export AMD_LOG_LEVEL=4
# HSA debugging
export HSA_ENABLE_DEBUG=1
# Control GPU visibility
export HIP_VISIBLE_DEVICES=0,1,2,3 # Specific GPUs
export ROCR_VISIBLE_DEVICES=0 # Specific GPU for ROCr
# Synchronous execution (helps debug async errors)
export HIP_LAUNCH_BLOCKING=1# Common build commands:
./install.sh -idc # Install + Dependencies + Clients
./install.sh -c --debug # Clients + Debug build
./install.sh -c --address-sanitizer # Clients + ASAN build
./install.sh -i # Install only
./install.sh -d # Dependencies
./install.sh -c # Build clients (benchmarks/tests)
./install.sh --debug # Debug build
./install.sh --address-sanitizer # ASAN build**Title:** [hipBLASLt][Runtime][MI123] Random segfault when running JAX - hipblasLtMatmulAlgoGetHeuristic
**ERROR TYPE:** Segfault
**REPRODUCIBILITY:** Flaky, 4% single-node to 50% at 4 nodes
**OBSERVED:**Fatal Python error: Segmentation fault Thread 0x00007f8a4c7fa700: hipblasLtMatmulAlgoGetHeuristic TensileHost::solutionSelectionLibrary Segmentation fault (core dumped)
Occurs randomly during JAX XLA compilation, before training starts.
**EXPECTED:** JAX compilation completes, training begins
**VERSIONS:**
- ROCm: A.B.C
- hipBLASLt: 1.2.3.be40066
- Docker: my:image
- GPU: MI123 (gfx456), 1-64 nodes, 8 GPUs/node
- Framework: JAX G.H.I
**MULTIPLE INSTALLATIONS:** No conflicts found
**REGRESSION:**
- Last good: JAX D.E.F
- First bad: JAX G.H.I
- Suspect: JAX XLA GEMM compilation changes
**DEBUG LOGS (TENSILE_DB=0x8040):**
- Success (1 node): http://example.com/good.log
- Failure (4 nodes): http://example.com/bad.log
- Comparison: http://example.com/diff.txt
**KERNEL DIFFERENCES:**
```diff
< Selected solution 42 for GEMM(4096,4096,4096)
> Selected solution 157 for GEMM(4096,4096,4096) β SEGFAULT
GEMM BENCH: Does NOT reproduce with hipblaslt-bench
CORE DUMP: http://example.com/core.12345
#0 hipblasLtMatmulAlgoGetHeuristic ()
#1 TensileHost::solutionSelectionLibrary ()
#2 xla::gpu::GemmThunk::ExecuteOnStream ()
MACHINE ACCESS:
Docker: docker pull my:image (public)
FLAKY DATA:
| Config | Runs | Failures | Rate |
|---|---|---|---|
| 1 node | 50 | 2 | 4% |
| 4 nodes | 20 | 10 | 50% |
CONTEXT: Appears to be race condition in kernel selection when JAX compilation calls hipblasLtMatmulAlgoGetHeuristic from multiple threads.
Hypothesis: Solution selection database not thread-safe.
Suggested: Check TensileHost thread-safety, add mutex if needed.
---
## π Appendix C: Sanitizer Build Reference
### Address Sanitizer (ASAN)
```bash
./install.sh -c --address-sanitizer
export ASAN_OPTIONS="detect_leaks=0:log_path=asan.log"
./run_test.sh
-
File issue without running with
TENSILE_DB=0x8040- β DO: Always enable debug logging first
-
File flaky issue with <20 tests or <10% failure rate
- β DO: Establish solid reproduction rate
-
Skip regression analysis if you know last good version
- β DO: Compare good vs bad FIRST - most valuable debug info
-
Provide only partial logs or screenshots
- β DO: Attach complete logs from start to error
-
Forget to check for multiple ROCm installations
- β DO: Always run find /opt -name "rocm*"
-
File without environment information
- β DO: Provide Docker image or machine details
-
Mix multiple unrelated errors in one issue
- β DO: One issue per issue