Skip to content

Conversation

@yudongsi
Copy link
Contributor

@yudongsi yudongsi commented Oct 31, 2024

Locally:
~6.16 to 6.44(Default path) 6.79 (Advanced path)

Before:

default:

matmul-performance:
        B    M      K        N  Triton-GB/s   XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0  4096.0  8.0  128.0  16384.0   866.781315  1013.591803        860.94579     1009.020643       869.550433     1016.113302       6.161104      7.204637           6.119625          7.172145           6.180787           7.22256    0.00239  0.001517



Advanced:
matmul-performance:
        B    M      K        N  Triton-GB/s   XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0  4096.0  8.0  128.0  16384.0   870.200789  1013.800005       862.488109     1010.581648       875.605071     1018.837387        6.18541      7.206117           6.130588          7.183241           6.223824          7.241923   0.004391  0.002175



After:

default:
matmul-performance:
        B    M      K        N  Triton-GB/s  XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0  4096.0  8.0  128.0  16384.0   906.811524  1012.40191        900.05344     1006.608438       910.558331      1019.53374        6.44564      7.196179           6.397603          7.154999           6.472272          7.246872   0.002137  0.001734


Advanced:
matmul-performance:
        B    M      K        N  Triton-GB/s   XeTLA-GB/s  Triton-GB/s-min  XeTLA-GB/s-min  Triton-GB/s-max  XeTLA-GB/s-max  Triton-TFlops  XeTLA-TFlops  Triton-TFlops-min  XeTLA-TFlops-min  Triton-TFlops-max  XeTLA-TFlops-max  Triton-CV  XeTLA-CV
0  4096.0  8.0  128.0  16384.0   954.660979  1012.176091       952.807487     1009.501134       959.487262     1015.609403       6.785755      7.194574            6.77258           7.17556            6.82006          7.218978   0.002269  0.001743

@etiotto etiotto merged commit 344bf2c into main Oct 31, 2024
5 checks passed
@etiotto etiotto deleted the yudong/2434 branch October 31, 2024 13:20
@whitneywhtsang
Copy link
Contributor

Does XeTLA use the new block sizes? If not, is there another reason why XeTLA is faster?

@yudongsi
Copy link
Contributor Author

yudongsi commented Nov 4, 2024

Does XeTLA use the new block sizes? If not, is there another reason why XeTLA is faster?

No, XeTLA don't use new block size, (~90% of XeTLA now, and CI not reproduced local ~6.8 on adv path), still need investigation on it.

Reopened related issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[GEMM] Improve performance of shape 4096x8x128x16384

5 participants