TensorRT-LLM does not utilize local Huggingface Model Cache

### System Info

CPU architecture: arm64
  GPU properties:
   * GPU name: NVIDIA GB200
   * GPU memory size: 192GB
   * Clock frequencies used: N/A

  Libraries:
   * TensorRT-LLM branch or tag: 1.0.0rc4 (from Dockerfile)
   * TensorRT-LLM commit: 69e9f6d48944b2ae0124ff57aa59340aa4dfae15
   * Versions of TensorRT, Modelopt, CUDA, cuBLAS, etc. used: These are bundled in the container image. The host is using CUDA 12.8
   * NVIDIA driver version: 570.133.20
   * OS: Container-Optimized OS from Google (Kernel version 6.6.93+)

  Any other information that may be useful in reproducing the bug:
   * The issue is being reproduced on a Kubernetes cluster.


### Who can help?

@juney-nvidia  / @ncomly-nvidia 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Set up the Environment:

Mount a persistent volume to serve as the Hugging Face cache (e.g., at /mnt/disk/models/huggingface_model_cache).

Set the following environment variables in the pod:

`HF_HOME=<path_to_cache>`
`HF_HUB_CACHE=<path_to_cache>`

Pre-populate the Cache:

Ensure the desired model (e.g., Qwen/Qwen3-4B) is already downloaded and present in the cache directory specified by HF_HOME.

Run the Benchmark (Online):

`trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --kv_cache_free_gpu_mem_fraction 0.95`

Run the Benchmark (Offline):

Add the `HF_HUB_OFFLINE=1` environment variable to the pod configuration.

`trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --kv_cache_free_gpu_mem_fraction 0.95`

### Expected behavior


When HF_HOME and HF_HUB_CACHE are set, TensorRT-LLM should first check this local cache for the requested model. If the model is present, it should be loaded from the cache without any network activity.

When HF_HUB_OFFLINE=1 is set, TensorRT-LLM should exclusively use the local cache and not attempt any outbound network connections to the Hugging Face Hub.


### actual behavior

Online setting:
The script ignores the local cache and proceeds to download the model from the Hugging Face Hub.

Offline setting:
 The script fails with the following traceback, indicating it's still attempting a network connection despite the offline flag.

Traceback:

```
1 Traceback (most recent call last):
2   File "/usr/local/bin/trtllm-bench", line 8, in <module>
3     sys.exit(main())
4   ...
5   File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 107, in send
6     raise OfflineModeIsEnabled(
7 huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/Qwen/Qwen3-4B/resolve/main/model.safetensors: offline mode is enabled. To disable it, please unset
  the `HF_HUB_OFFLINE` environment variable.
```

### additional notes

TensorRT-LLM appears to disregard the HF_HOME and HF_HUB_CACHE environment variables, which are standard in the Hugging Face ecosystem for specifying model cache locations. Even when these variables are set to a valid path containing pre-downloaded models, trtllm-bench still attempts to download the models from the Hugging Face Hub.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT-LLM does not utilize local Huggingface Model Cache #6422

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorRT-LLM does not utilize local Huggingface Model Cache #6422

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions