-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
System Info
CPU architecture: arm64
GPU properties:
- GPU name: NVIDIA GB200
- GPU memory size: 192GB
- Clock frequencies used: N/A
Libraries:
- TensorRT-LLM branch or tag: 1.0.0rc4 (from Dockerfile)
- TensorRT-LLM commit: 69e9f6d
- Versions of TensorRT, Modelopt, CUDA, cuBLAS, etc. used: These are bundled in the container image. The host is using CUDA 12.8
- NVIDIA driver version: 570.133.20
- OS: Container-Optimized OS from Google (Kernel version 6.6.93+)
Any other information that may be useful in reproducing the bug:
- The issue is being reproduced on a Kubernetes cluster.
Who can help?
@juney-nvidia / @ncomly-nvidia
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Set up the Environment:
Mount a persistent volume to serve as the Hugging Face cache (e.g., at /mnt/disk/models/huggingface_model_cache).
Set the following environment variables in the pod:
HF_HOME=<path_to_cache>
HF_HUB_CACHE=<path_to_cache>
Pre-populate the Cache:
Ensure the desired model (e.g., Qwen/Qwen3-4B) is already downloaded and present in the cache directory specified by HF_HOME.
Run the Benchmark (Online):
trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --kv_cache_free_gpu_mem_fraction 0.95
Run the Benchmark (Offline):
Add the HF_HUB_OFFLINE=1 environment variable to the pod configuration.
trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --kv_cache_free_gpu_mem_fraction 0.95
Expected behavior
When HF_HOME and HF_HUB_CACHE are set, TensorRT-LLM should first check this local cache for the requested model. If the model is present, it should be loaded from the cache without any network activity.
When HF_HUB_OFFLINE=1 is set, TensorRT-LLM should exclusively use the local cache and not attempt any outbound network connections to the Hugging Face Hub.
actual behavior
Online setting:
The script ignores the local cache and proceeds to download the model from the Hugging Face Hub.
Offline setting:
The script fails with the following traceback, indicating it's still attempting a network connection despite the offline flag.
Traceback:
1 Traceback (most recent call last):
2 File "/usr/local/bin/trtllm-bench", line 8, in <module>
3 sys.exit(main())
4 ...
5 File "/usr/local/lib/python3.12/dist-packages/huggingface_hub/utils/_http.py", line 107, in send
6 raise OfflineModeIsEnabled(
7 huggingface_hub.errors.OfflineModeIsEnabled: Cannot reach https://huggingface.co/Qwen/Qwen3-4B/resolve/main/model.safetensors: offline mode is enabled. To disable it, please unset
the `HF_HUB_OFFLINE` environment variable.
additional notes
TensorRT-LLM appears to disregard the HF_HOME and HF_HUB_CACHE environment variables, which are standard in the Hugging Face ecosystem for specifying model cache locations. Even when these variables are set to a valid path containing pre-downloaded models, trtllm-bench still attempts to download the models from the Hugging Face Hub.