-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).questionFurther information is requestedFurther information is requested
Description
System Info
System Information:
- OS: ubuntu 24.0
- Python version: 3.12
- CUDA version: 12.8
- GPU model(s): L40s
- Driver version:580.95.05
- TensorRT-LLM version:1.0.0
Detailed output:
Paste the output of the above commands here
How would you like to use TensorRT-LLM
I want to run inference of a https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-unquantized. I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.
Specific questions:
- Model:google/gemma-3-27b-it-qat-q4_0-unquantized
- Use case chatbot):
- Expected throughput/latency requirements: Latency
- Multi-GPU setup needed: No
Do we have support for q4_0 format.I am trying to convert this to tensort google/gemma-3-27b-it-qat-q4_0-unquantized(https://huggingface.co/google/gemma-3-27b-it-qat-q4_0-unquantized). I have converted the model to CausalLM using HF and from there python3 convert_checkpoint.py
--model-dir ${MODEL_PATH}
--output-model-dir ${TRT_CHECKPOINT_PATH}
--ckpt-type hf
--dtype bfloat16
--use_weight_only
--weight_only_precision int4
trtllm-build
--checkpoint_dir ${TRT_OUT_PATH}
--output_dir ${TRT_ENGINE}
--gemm_plugin auto
--gpt_attention_plugin auto
--remove_input_padding enable
--use_paged_context_fmha enable
--max_input_len 8192
--max_seq_len 16384
--max_num_tokens 32768
--max_beam_width 1
--max_batch_size 16
this generates garbage output
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Low PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).Lower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).questionFurther information is requestedFurther information is requested