-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendLow PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).
Description
🚀 The feature, motivation and pitch
- Block-wise GEMM quantization
- MLA + FP8 KV-cache
FP8 KV Cache and MLA is supported on Hopper and Blackwell. The accuracy loss is small, with GSM8k accuracy drop less than 1%. - Note! Different kernel sources for Hopper and Blackwell
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
AutoDeploy<NV> AutoDeploy Backend<NV> AutoDeploy BackendLow PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).
Type
Projects
Status
Backlog