Skip to content

Conversation

xytpai
Copy link

@xytpai xytpai commented Sep 23, 2025

we registered a new pattern-matching logic for integrating aiter::act_mul_and_mxfp4_quant

@xytpai xytpai changed the title [355_wip] Let inductor capture silu+mul+f4gemm pattern [355_wip] Let dynamo capture silu+mul+f4gemm pattern Sep 23, 2025
@xytpai
Copy link
Author

xytpai commented Sep 24, 2025

BASELINE:

local-completions (model=/data/Llama-3.3-70B-Instruct-MXFP4-Preview,base_url=http://localhost:9771/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.932|±  |0.0160|
|     |       |strict-match    |     5|exact_match|↑  |0.056|±  |0.0146|
============ Serving Benchmark Result ============
Successful requests:                     256       
Maximum request concurrency:             64        
Request rate configured (RPS):           2.00      
Benchmark duration (s):                  290.21    
Total input tokens:                      261888    
Total generated tokens:                  262144    
Request throughput (req/s):              0.88      
Output token throughput (tok/s):         903.28    
Total Token throughput (tok/s):          1805.68   
---------------Time to First Token----------------
Mean TTFT (ms):                          2107.39   
Median TTFT (ms):                        1916.77   
P99 TTFT (ms):                           4252.91   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          65.00     
Median TPOT (ms):                        67.95     
P99 TPOT (ms):                           69.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           65.00     
Median ITL (ms):                         42.93     
P99 ITL (ms):                            988.55    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          68598.03  
Median E2EL (ms):                        72068.74  
P99 E2EL (ms):                           72315.47  
==================================================

NEW:

local-completions (model=/data/Llama-3.3-70B-Instruct-MXFP4-Preview,base_url=http://localhost:9771/v1/completions,num_concurrent=256,max_retries=10,max_gen_toks=2048), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.952|±  |0.0135|
|     |       |strict-match    |     5|exact_match|↑  |0.052|±  |0.0141|

============ Serving Benchmark Result ============
Successful requests:                     256       
Maximum request concurrency:             64        
Request rate configured (RPS):           2.00      
Benchmark duration (s):                  282.66    
Total input tokens:                      261888    
Total generated tokens:                  262144    
Request throughput (req/s):              0.91      
Output token throughput (tok/s):         927.42    
Total Token throughput (tok/s):          1853.93   
---------------Time to First Token----------------
Mean TTFT (ms):                          2229.05   
Median TTFT (ms):                        2081.75   
P99 TTFT (ms):                           4068.82   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          63.08     
Median TPOT (ms):                        65.72     
P99 TPOT (ms):                           68.25     
---------------Inter-token Latency----------------
Mean ITL (ms):                           63.08     
Median ITL (ms):                         42.06     
P99 ITL (ms):                            937.36    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          66760.35  
Median E2EL (ms):                        69354.50  
P99 E2EL (ms):                           70977.04  
==================================================

@xytpai
Copy link
Author

xytpai commented Sep 24, 2025

image

@xytpai
Copy link
Author

xytpai commented Sep 25, 2025

To enable mxfp4 fusion:

--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE", "custom_ops": ["+rms_norm", "+silu_and_mul"]}'

@xytpai
Copy link
Author

xytpai commented Sep 25, 2025

To reduce the number of PRs, this PR has been merged into #705

@xytpai xytpai closed this Sep 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant