Skip to content

TensorRT-LLM does not utilize local Huggingface Model Cache #6422

@Chris113113

Description

@Chris113113

System Info

CPU architecture: arm64
GPU properties:

  • GPU name: NVIDIA GB200
  • GPU memory size: 192GB
  • Clock frequencies used: N/A

Libraries:

  • TensorRT-LLM branch or tag: 1.0.0rc4 (from Dockerfile)
  • TensorRT-LLM commit: 69e9f6d
  • Versions of TensorRT, Modelopt, CUDA, cuBLAS, etc. used: These are bundled in the container image. The host is using CUDA 12.8
  • NVIDIA driver version: 570.133.20
  • OS: Container-Optimized OS from Google (Kernel version 6.6.93+)

Any other information that may be useful in reproducing the bug:

  • The issue is being reproduced on a Kubernetes cluster.

Who can help?

@juney-nvidia / @ncomly-nvidia

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Set up the Environment:

Mount a persistent volume to serve as the Hugging Face cache (e.g., at /mnt/disk/models/huggingface_model_cache).

Set the following environment variables in the pod:

HF_HOME=<path_to_cache>
HF_HUB_CACHE=<path_to_cache>

Pre-populate the Cache:

Ensure the desired model (e.g., Qwen/Qwen3-4B) is already downloaded and present in the cache directory specified by HF_HOME.

Run the Benchmark (Online):

trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --kv_cache_free_gpu_mem_fraction 0.95

Run the Benchmark (Offline):

Add the HF_HUB_OFFLINE=1 environment variable to the pod configuration.

trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --kv_cache_free_gpu_mem_fraction 0.95

Expected behavior

When HF_HOME and HF_HUB_CACHE are set, TensorRT-LLM should first check this local cache for the requested model. If the model is present, it should be loaded from the cache without any network activity.

When HF_HUB_OFFLINE=1 is set, TensorRT-LLM should exclusively use the local cache and not attempt any outbound network connections to the Hugging Face Hub.

actual behavior

Online setting:
The script ignores the local cache and proceeds to download the model from the Hugging Face Hub.

Offline setting:
The script fails with the following traceback, indicating it's still attempting a network connection despite the offline flag.

Traceback:

1 Traceback (most recent call last):
2   File "/usr/local/bin/trtllm-bench", line 8, in <module>
3     sys.exit(main())
4   ...
5   File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 107, in send
6     raise OfflineModeIsEnabled(
7 huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/Qwen/Qwen3-4B/resolve/main/model.safetensors: offline mode is enabled. To disable it, please unset
  the `HF_HUB_OFFLINE` environment variable.

additional notes

TensorRT-LLM appears to disregard the HF_HOME and HF_HUB_CACHE environment variables, which are standard in the Hugging Face ecosystem for specifying model cache locations. Even when these variables are set to a valid path containing pre-downloaded models, trtllm-bench still attempts to download the models from the Hugging Face Hub.

Metadata

Metadata

Assignees

No one assigned

    Labels

    LLM API<NV>High-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.bugSomething isn't workingstalewaiting for feedback

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions