Full documentation for hipBLASLt is available at rocm.docs.amd.com/projects/hipBLASLt.
- Stream-K GEMM support has been enabled for the
FP32,FP16,BF16,FP8, andBF8data types on the MI300A APU. To activate this feature, set theTENSILE_SOLUTION_SELECTION_METHODenvironment variable to2, for example,export TENSILE_SOLUTION_SELECTION_METHOD=2.
HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER_VEC_EXTandHIPBLASLT_MATMUL_DESC_B_SCALE_POINTER_VEC_EXTare deprecated, useROCBLASLT_MATMUL_DESC_A_SCALE_MODEandROCBLASLT_MATMUL_DESC_B_SCALE_MODEattributes to set scalar (HIPBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F) or vector (HIPBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F).- The non-V2 APIs (
GemmPreference,GemmProblemType,GemmEpilogue,GemmTuning,GemmInputs) in the Cpp header are now the same as the V2 APIs (GemmPreferenceV2,GemmProblemTypeV2,GemmEpilogueV2,GemmTuningV2,GemmInputsV2). The original non-V2 APIs are removed.
HIPBLASLT_MATMUL_DESC_A_SCALE_POINTER_VEC_EXTandHIPBLASLT_MATMUL_DESC_B_SCALE_POINTER_VEC_EXTare deprecated.hipblasltExtAMaxWithScaleAPI is deprecated.- V2 APIs (
GemmPreferenceV2,GemmProblemTypeV2,GemmEpilogueV2,GemmTuningV2,GemmInputsV2) are deprecated.
- Support roctx if
HIPBLASLT_ENABLE_MARKER=1is set - Output the profile logging if
HIPBLASLT_LOG_MASK=64is set - Support FP16 compute type
- Add memory bandwidth information in hipblaslt-bench output
- Support user offline tuning mechanism
- Add more samples
- Output the bench command along with solution index if
HIPBLASLT_LOG_MASK=32is set
- Improve the overall performance of XF32/FP16/BF16/FP8/BF8 data type
- Reduce library size
- Fix multi-threads bug
- Fix multi-streams bug
- Support the V2 CPP extension API for backward compatibility
- Support for data type Int8 in with Int8 out
- Support for data type FP32/FP64 for gfx110x
- Add the Extension API
hipblaslt_ext::matmulIsTuned - Output atol and rtol for hipblaslt-bench validation
- Output the bench command for hipblaslt CPP ext API path if
HIPBLASLT_LOG_MASK=32is set - Support odd sizes for FP8/BF8 GEMM
- Reorganize and add more sample code
- Add a dependency with the hipblas-common package and remove the dependency with the hipblas package
- Support fused kernel for HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER for FP8/BF8 data type
- Improve the library loading time
- Improve the overall performance of first returned solution
- The V1 CPP extension API will be deprecated in a future release of hipBLASLt
- Extension APIs:
hipblasltExtAMaxWithScale
GemmTuningextension parameter to set wgm by user- Support HIPBLASLT_MATMUL_DESC_AMAX_D_POINTER for the FP8/BF8 data types
- Support for FP8/BF8 input, FP32/FP16/BF16/F8/BF8 output (only for the gfx94x architectures)
- Support HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_A_EXT and HIPBLASLT_MATMUL_DESC_COMPUTE_INPUT_TYPE_B_EXT for FP16 input data type to use FP8/BF8 mfma
- Support for the gfx110x architecture
- Improve the library loading time
- Extension APIs:
hipblasltExtSoftmaxhipblasltExtLayerNormhipblasltExtAMax
GemmTuningextension parameter to set split-k by user- Support for mixed-precision datatype: FP16/FP8 in with FP16 out
- Add CMake support for documentation
- algoGetHeuristic() ext API for GroupGemm will be deprecated in a future release of hipBLASLt
- New
UserArgumentsvariable forGroupedGemm - Support for datatype: FP16 in with FP32 out
- Support for datatype: Int8 in Int32 out
- Support for gfx94x platform
- Support for FP8/BF8 datatype (only for gfx94x platform)
- Support scalar A,B,C,D for FP8/BF8 datatype
- Added samples
- Replaced
hipblasDatatype_twithhipDataType - Replaced
hipblasLtComputeType_twithhipblasComputeType_t
- Deprecated
HIPBLASLT_MATMUL_DESC_D_SCALE_VECTOR_POINTER
- Added
getAllAlgosextension APIs - TensileLite support for new epilogues: gradient gelu, gradient D, gradient A/B, aux
- Added a sample package that includes three sample apps
- Added a new C++ GEMM class in the hipBLASLt extension
- Refactored GroupGemm APIs as C++ class in the hipBLASLt extension
- Changed the scaleD vector enum to
HIPBLASLT_MATMUL_DESC_D_SCALE_VECTOR_POINTER
- Enabled norm check validation for CI
- GSU kernel: wider memory, PGR N
- Updated logic yaml to improve some FP16 NN sizes
- GroupGemm support for GSU kernel
- Added grouped GEMM tuning for aldebaran
- Added CI tests for TensileLite
- Initialized extension group GEMM APIs (FP16 only)
- Added a group GEMM sample app:
example_hipblaslt_groupedgemm
- Fixed incorrect results for the ScaleD kernel
- Tuned equality sizes for the HHS data type
- Reduced host-side overhead for
hipblasLtMatmul() - Removed unused kernel arguments
- Schedule values setup before first
s_waitcnt - Refactored TensileLite host codes
- Optimized build time
- Enabled hipBLASLt APIs
- Support for gfx90a
- Support for problem type: FP32, FP16, BF16
- Support activation: relu, gelu
- Support for bias vectors
- Integrated with TensileLite kernel generator
- Added Gtest:
hipblaslt-test - Added the full function tool
hipblaslt-bench - Added the sample app
example_hipblaslt_preference
- gridBase solution search algorithm for untuned size
- Tuned 10k sizes for each problem type