-
Notifications
You must be signed in to change notification settings - Fork 2k
Open
Labels
LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.questionFurther information is requestedFurther information is requestedstalewaiting for feedback
Description
System Info
hi,
i have a large language model. how can quantize the model for use with llm api with attn_backend = "flashinfer"? do i need to quantize the model ahead of time - if yes, what format can be supported with llm api?
i get error: "cannot be used with PyTorch backend: ['quant_config']"
Thanks
cc @karljang @LinPoly @laikhtewari
How would you like to use TensorRT-LLM
I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.
Specific questions:
- Model:
- Use case (e.g., chatbot, batch inference, real-time serving):
- Expected throughput/latency requirements:
- Multi-GPU setup needed:
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.questionFurther information is requestedFurther information is requestedstalewaiting for feedback