Add silu-and-mul and per-token dynamic FP8 quant fusion #852
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
This PR adds fusion of silu-and-mul and dynamic per-token fp8 quant to speed up ptpc-fp8 model inferencing.
To enable this fusion, add to the compilation config
--compilation-config '{"pass_config": {"fuse_act_quant": "true"}}'.This fusion pass requires new aiter features not yet added to upstream aiter, so the fusion is disabled by default. It is added here for model validation and faster upstream cherry picking once the features are added to aiter.
Test Plan
Unit test for the new fusion pass.
Model end-to-end test on RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamic with silu-mul-ptpcfp8 fusion.
Test Result
End-to-end tests with
RedhatAI/Qwen2.5-VL-72B-Instruct-FP8-dynamicTP4 on MI300Xlm_eval with ChartQA
Throughput test
without fusion
with fusion
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.