add fp8 kv argument for llama3 example #2372
Open
+37
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR Type
Enhancement
Description
Added
static_kv_dtypeargument for FP8 quantizationUpdated
run_benchmark.shto support FP8 KV cacheModified
run_quant.shto includestatic_kv_dtypeUpdated README with instructions for enabling FP8 KV cache
Diagram Walkthrough
File Walkthrough
quantize.py
Add static_kv_dtype argumentexamples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/quantize.py
static_kv_dtypeargument to parserstatic_kv_dtypetoload_recipe_resultsrun_benchmark.sh
Update run_benchmark.sh for FP8 KV cacheexamples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_benchmark.sh
-kvoption to handle KV cache dtypelm_evalcommand to includekv_cache_dtyperun_quant.sh
Modify run_quant.sh for FP8 KV cacheexamples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/run_quant.sh
-kvoption to handle KV cache dtypestatic_kv_dtypetoCOMMON_ARGSREADME.md
Update README with FP8 KV cache instructionsexamples/pytorch/nlp/huggingface_models/language-modeling/quantization/auto_round/llama3/README.md