-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Problem
CUBLAS_GEMM_DEFAULT_TENSOR_OP (algorithm 99) produces ALL NaN output for transposed backward GEMMs (Trans/NoTrans, NoTrans/Trans) when input gradient magnitudes reach ~1e5. This occurs around block 18 of a 24-layer backward pass where gradient magnification grows from ~1e-5 to ~1e5.
Forward GEMMs (NoTrans/NoTrans) are unaffected.
Five Whys
- Why NaN weights? → optimizer reads NaN gradients
- Why NaN gradients? → cuBLAS backward_a/b output ALL NaN
- Why NaN output from valid inputs? → tensor core GEMM algorithm
- Why only backward? → backward uses Trans flag, forward doesn't
- Why only after ~5 blocks? → gradient magnification reaches ~1e5
Fix
Switch from tensor core math to SIMD math:
CUBLAS_TF32_TENSOR_OP_MATH→CUBLAS_DEFAULT_MATHCUBLAS_COMPUTE_32F_FAST_TF32→CUBLAS_COMPUTE_32FCUBLAS_GEMM_DEFAULT_TENSOR_OP→CUBLAS_GEMM_DEFAULT
Performance
cuBLAS SIMD is still 6-14x faster than hand-written PTX:
- PTX baseline: 890 tok/s, 2.6% MFU
- cuBLAS SIMD: 5,216 tok/s, 15.1% MFU
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels