[355_wip] Let dynamo capture silu+mul+f4gemm pattern #701

xytpai · 2025-09-23T17:16:31Z

we registered a new pattern-matching logic for integrating aiter::act_mul_and_mxfp4_quant

xytpai · 2025-09-24T08:06:15Z

BASELINE:

local-completions (model=/data/Llama-3.3-70B-Instruct-MXFP4-Preview,base_url=http://localhost:9771/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0160|
|     |       |strict-match    |     5|exact_match|↑  |0.056|±  |0.0146|
============ Serving Benchmark Result ============
Successful requests:                     256       
Maximum request concurrency:             64        
Request rate configured (RPS):           2.00      
Benchmark duration (s):                  290.21    
Total input tokens:                      261888    
Total generated tokens:                  262144    
Request throughput (req/s):              0.88      
Output token throughput (tok/s):         903.28    
Total Token throughput (tok/s):          1805.68   
---------------Time to First Token----------------
Mean TTFT (ms):                          2107.39   
Median TTFT (ms):                        1916.77   
P99 TTFT (ms):                           4252.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.00     
Median TPOT (ms):                        67.95     
P99 TPOT (ms):                           69.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.00     
Median ITL (ms):                         42.93     
P99 ITL (ms):                            988.55    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          68598.03  
Median E2EL (ms):                        72068.74  
P99 E2EL (ms):                           72315.47  
==================================================

NEW:

local-completions (model=/data/Llama-3.3-70B-Instruct-MXFP4-Preview,base_url=http://localhost:9771/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.952|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.052|±  |0.0141|

============ Serving Benchmark Result ============
Successful requests:                     256       
Maximum request concurrency:             64        
Request rate configured (RPS):           2.00      
Benchmark duration (s):                  282.66    
Total input tokens:                      261888    
Total generated tokens:                  262144    
Request throughput (req/s):              0.91      
Output token throughput (tok/s):         927.42    
Total Token throughput (tok/s):          1853.93   
---------------Time to First Token----------------
Mean TTFT (ms):                          2229.05   
Median TTFT (ms):                        2081.75   
P99 TTFT (ms):                           4068.82   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.08     
Median TPOT (ms):                        65.72     
P99 TPOT (ms):                           68.25     
---------------Inter-token Latency----------------
Mean ITL (ms):                           63.08     
Median ITL (ms):                         42.06     
P99 ITL (ms):                            937.36    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          66760.35  
Median E2EL (ms):                        69354.50  
P99 E2EL (ms):                           70977.04  
==================================================

xytpai · 2025-09-24T08:17:34Z

xytpai · 2025-09-25T06:50:16Z

To enable mxfp4 fusion:

--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "custom_ops": ["+rms_norm", "+silu_and_mul"]}'

xytpai · 2025-09-25T08:42:55Z

To reduce the number of PRs, this PR has been merged into #705

xytpai added 2 commits September 24, 2025 01:13

Update activation_quant_fusion.py

4ee35bb

Update quark_w4a4_mxfp4.py

22e36f5

xytpai changed the title ~~[355_wip] Let inductor capture silu+mul+f4gemm pattern~~ [355_wip] Let dynamo capture silu+mul+f4gemm pattern Sep 23, 2025

Refine example inputs

b41c882

xytpai requested review from dllehr-amd, wuhuikx and zejunchen-zejun September 24, 2025 08:25

xytpai closed this Sep 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[355_wip] Let dynamo capture silu+mul+f4gemm pattern #701

[355_wip] Let dynamo capture silu+mul+f4gemm pattern #701

Uh oh!

xytpai commented Sep 23, 2025 •

edited

Loading

Uh oh!

xytpai commented Sep 24, 2025 •

edited

Loading

Uh oh!

xytpai commented Sep 24, 2025

Uh oh!

xytpai commented Sep 25, 2025

Uh oh!

xytpai commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[355_wip] Let dynamo capture silu+mul+f4gemm pattern #701

[355_wip] Let dynamo capture silu+mul+f4gemm pattern #701

Uh oh!

Conversation

xytpai commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xytpai commented Sep 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xytpai commented Sep 24, 2025

Uh oh!

xytpai commented Sep 25, 2025

Uh oh!

xytpai commented Sep 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xytpai commented Sep 23, 2025 •

edited

Loading

xytpai commented Sep 24, 2025 •

edited

Loading