WIP: TF32 Performance and Refactor by carsonbrownlee · Pull Request #2274 · ROCm/hipBLASLt

carsonbrownlee · 2025-07-01T05:55:55Z

refactor: tf32 codegen into F32XEmulation module

add: tf32 support for different vector widths and depthu

add: lds support for both inputs in TF32 MFMA

fix: adding attribute check to issueLatency and other calls that assume type

add: support for tf32 16x16x32 mfma and 256x256 tile sizes

add: adding absolute and relative error output to hipblaslt-bench client

add: tuned tf32 yamls for high compute intensity

tf32 support for different vector widths and depthu tf32 lds debug adding attribute check to issueLatency calls tf32 fix lds tf32 fix for hipblaslt build tf32 lds fix for broken wave dims support for tf32 16x16x32 mfma tf32 16x16x32 debug adding absolute and relative error output to hipblaslt-bench client tf32 16x16x32 debug fixing broken fp32 runs from instOffset tf32 MI 16x16x32 working with 256b reads fixing broken tf32 16x16x16 kernels tf32 emulation optimization tf32 fix for lds tf32 performance optimizations. k32 MI working with 256 tile sizes tf32 yaml for high compute intensity

TF32 origami libs with 16x16x32 and optional lsu

…r tf32 test build

Temporarily disable warning message on missing instruction latency for tf32 test build

## Motivation Recently a customer requested to support sigmoid activation function in hipblaslt. ## Technical Details Tensilelite already supports sigmoid and has the GPU code Module named "Sigmoid" implemented in Activation.py. In order to enable this feature in hipblaslt, we had to update the enum types for hipblaslt_activation_type, epilogue types in hipblaslt and rocblaslt abstractions. Added the gflops count in flops.hpp and updated the utility functions. ## Test Plan New additional smoke tests and Matmul tests were added and existing tests were extended to consider sigmoid activation function as well. Unit tests were added to test for the newly added enum values for sigmoid activation_type and epilogue. ## Test Result All the above mentioned tests passed when running hipblaslt-test --------- Co-authored-by: Madhusoodhanan Prabha <amadhuso@ctr2-alola-login-01.amd.com> Co-authored-by: Madhusoodhanan Prabha <amadhuso@ctr2-alola-ctrl-01.amd.com>

carsonbrownlee requested review from Jinp800125, KKyang, Serge45, TonyYHsieh, aazz44ss, hcman2, imcarsonliao, jichangjichang, solaslin and vin-huang as code owners July 1, 2025 05:55

carsonbrownlee added 2 commits July 8, 2025 20:20

tf32 in situ conversion fix

298d44b

b-shi force-pushed the cbrownle/tf32_performance branch from 32d04c1 to 2d7b251 Compare July 9, 2025 18:46

carsonbrownlee and others added 10 commits July 11, 2025 15:33

debugging k32. fix for k16 without lds preconversions

1e6f132

Add cvt after lr tf32 emu impl

678acc2

2xMIk fix for non TN

42e4f15

remove print

acd2f0a

SourceSwap Fix

5c19f6e

Don't interleave cvt+pack with mfmas

e7567cc

kstride fix

4621d42

2xmik fix part2

3a8b035

Re-enable interleave

e3be90a

Use dot2c instead of cvt+sub

e232dae

carsonbrownlee force-pushed the cbrownle/tf32_performance branch from 4f80d52 to 5dee74a Compare July 13, 2025 20:32

Remove fp32->tf32 downcast in hipblaslt client

f4bcfb2

carsonbrownlee force-pushed the cbrownle/tf32_performance branch from 5dee74a to 6c6919b Compare July 14, 2025 08:39

AlexBrownAMD and others added 4 commits July 14, 2025 12:38

TF32 origami libs with 16x16x32 and optional lsu

95dc4f7

Merge pull request ROCm#4 from AlexBrownAMD/tf32_perf

a0c1989

TF32 origami libs with 16x16x32 and optional lsu

Temporarily disable warning message on missing instruction latency fo…

e0c5fd1

…r tf32 test build

Merge pull request ROCm#5 from AlexBrownAMD/tf32_perf

75281e7

Temporarily disable warning message on missing instruction latency for tf32 test build

carsonbrownlee added 3 commits July 17, 2025 01:28

tf32 kernels updated

2768dd6

tf32 updating library logic files

84f6188

updating tf32 kernels

ede0beb

carsonbrownlee force-pushed the cbrownle/tf32_performance branch from c3ba315 to ede0beb Compare July 17, 2025 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: TF32 Performance and Refactor#2274

WIP: TF32 Performance and Refactor#2274
carsonbrownlee wants to merge 20 commits intoROCm:developfrom
carsonbrownlee:cbrownle/tf32_performance

carsonbrownlee commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

carsonbrownlee commented Jul 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants