Skip to content

[question] llm api #10044

@geraldstanje

Description

@geraldstanje

System Info

hi,
i have a large language model. how can quantize the model for use with llm api with attn_backend = "flashinfer"? do i need to quantize the model ahead of time - if yes, what format can be supported with llm api?

i get error: "cannot be used with PyTorch backend: ['quant_config']"

Thanks

cc @karljang @LinPoly @laikhtewari

How would you like to use TensorRT-LLM

I want to run inference of a [specific model](put Hugging Face link here). I don't know how to integrate it with TensorRT-LLM or optimize it for my use case.

Specific questions:

  • Model:
  • Use case (e.g., chatbot, batch inference, real-time serving):
  • Expected throughput/latency requirements:
  • Multi-GPU setup needed:

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.questionFurther information is requestedstalewaiting for feedback

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions