From 05f3aab123868ba124a8fb3b7078032d7ecdbe86 Mon Sep 17 00:00:00 2001 From: jaycoolslm <86686746+jaycoolslm@users.noreply.github.com> Date: Sat, 6 Dec 2025 18:16:52 +0000 Subject: [PATCH 1/2] Clarify autoq_format note in README Updated the note regarding autoq_format to clarify the support for int8_sq and fp8 usage. Signed-off-by: jaycoolslm <86686746+jaycoolslm@users.noreply.github.com> --- examples/quantization/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/quantization/README.md b/examples/quantization/README.md index e74736b61b8..2ea4d82408f 100644 --- a/examples/quantization/README.md +++ b/examples/quantization/README.md @@ -67,7 +67,7 @@ Checkpoint saved in `output_dir` can be directly passed to `trtllm-build`. - int8_wo: Actually nothing is applied to weights. Weights are quantized to INT8 channel wise when TRTLLM building the engine. - int4_wo: Same as int8_wo but in INT4. - full_prec: No quantization. -- autoq_format: Specific quantization algorithms are searched in auto quantization. The algorithm must in ['fp8', 'int4_awq', 'w4a8_awq', 'int8_sq'] and you can use ',' to separate more than one quantization algorithms, such as `--autoq_format fp8,int4_awq,w4a8_awq`. Please attention that using int8_sq and fp8 together is not supported. +- autoq_format: Specific quantization algorithms are searched in auto quantization. The algorithm must be in ['fp8', 'int4_awq', 'w4a8_awq', 'int8_sq'] and you can use ',' to separate more than one quantization algorithms, such as `--autoq_format fp8,int4_awq,w4a8_awq`. Please note that using int8_sq and fp8 together is not supported. - auto_quantize_bits: Effective bits constraint for auto quantization. If not set, regular quantization without auto quantization search is applied. Note: it must be set within correct range otherwise it will be set by lowest value if possible. For example, the weights of LLMs have 16 bits defaultly and it results in a weight compression rate of 40% if we set `auto_quantize_bits` to 9.6 (9.6 / 16 = 0.6), which means the average bits of the weights are 9.6 but not 16. However, which format to choose is determined by solving an optimization problem, so you need to generate the according checkpoint manually if you want to customize your checkpoint formats. The format of mixed precision checkpoint is described in detail below. - output_dir: Path to save the quantized checkpoint. - dtype: Specify data type of model when loading from Hugging Face. From 4cd5bccabaef1575de1b997ab1212d8b793e383d Mon Sep 17 00:00:00 2001 From: jaycoolslm <86686746+jaycoolslm@users.noreply.github.com> Date: Sat, 6 Dec 2025 18:21:08 +0000 Subject: [PATCH 2/2] Fix nemo-toolkit version specification in requirements Signed-off-by: jaycoolslm <86686746+jaycoolslm@users.noreply.github.com> --- examples/quantization/requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/quantization/requirements.txt b/examples/quantization/requirements.txt index f14563d3a47..9c65f61fb7b 100644 --- a/examples/quantization/requirements.txt +++ b/examples/quantization/requirements.txt @@ -1,7 +1,7 @@ -c ../constraints.txt tensorrt_llm>=0.0.0.dev0 datasets==3.1.0 -nemo-toolkit[all]==2.0.0rc1 +nemo-toolkit==2.0.0rc1 rouge_score transformers_stream_generator==0.0.4 tiktoken