TensorRT trying to gobble 100% GPU VRAM without considering what's already in use

## Description

TRTLLM-BUILD tries to allocate the entire GPU's memory without considering that some of it may be in-use or reserved .. And there's no way to specify a vram_limit to the build command. 

Tried hacking builder.py to read an env var to specify max workspace on config.create, but no dice.. So I tried hooking both cuMemGetInfo_v2 and cudaMemGetInfo via LD_PRELOAD but it seems those functions aren't ever called.


## Environment

NVIDIA-supplied TRT-LLM 1.2.0rc4 docker

**TensorRT Version**: 10.x

**NVIDIA GPU**: 5070

**NVIDIA Driver Version**: 581.57

**CUDA Version**: 13.0.97

**CUDNN Version**: latest


Operating System: Win10 + WSL (or docker, have tried both)

Python Version (if applicable): 3.12

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.9.0

Baremetal or Container (if so, version): Bare and Container


## Relevant Files

**Model link**: https://huggingface.co/SanjiWatsuki/Kunoichi-DPO-v2-7B/tree/main


## Steps To Reproduce

Run TRTLLM-BUILD under Windows+WSL (where Windows+WDDM+etc reserve 0.5-1.5GB VRAM just for OS display+compositing) with a decent-sized model, in my case Kunoichi 7B

**Commands or scripts**:
trtllm-build \
  --checkpoint_dir /mnt/c/ai/models/ckpt_kuno \
  --output_dir /mnt/c/ai/models/kuno_eng_fp16 \
  --max_batch_size 1 \
  --max_input_len 1024 \
  --max_seq_len 1024 \
  --max_num_tokens 1024 \
  --kv_cache_type paged \
  --paged_kv_cache enable \
  --gpt_attention_plugin float16 \
  --gemm_plugin float16 \
  --monitor_memory \
  --log_level info

**Have you tried [the latest release](https://developer.nvidia.com/tensorrt)?**: Yes

**Can this model run on other frameworks?** For example run ONNX model with ONNXRuntime (`polygraphy run <model.onnx> --onnxrt`): Unknown


-------------
The short and skinny is simply this; virtualMemoryBuffer.cpp is trying to allocate basically the *entire* GPU's memory without bothering to see how much is used, or offering *ANY* way to set a hard-cap (I've even tried hooking (with LD_PRELOAD) both cuMemGetInfo_v2 and cudaMemGetInfo -- sadly with no luck as it seems those two functions aren't called by the memory manager and it uses some internal mechanism to determine the size to allocate.
```
[12/06/2025-22:12:21] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::154] Error Code 2: OutOfMemory (Requested size was 11665408000 bytes.)
[12/06/2025-22:12:21] [TRT] [E] [virtualMemoryBuffer.cpp::resizePhysical::141] Error Code 1: Cuda Driver (In resizePhysical at optimizer/builder/virtualMemoryBuffer.cpp:141)
[12/06/2025-22:12:21] [TRT] [W] Requested amount of GPU memory (11665408000 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[12/06/2025-22:12:21] [TRT] [E] [globWriter.cpp::makeResizableGpuMemory::514] Error Code 2: OutOfMemory (Requested size was 11665408000 bytes.)
Traceback (most recent call last):
  File "/home/mash/ai/trtllm_env312/bin/trtllm-build", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/commands/build.py", line 542, in main
    parallel_build(model_config, ckpt_dir, build_config, args.output_dir,
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/commands/build.py", line 381, in parallel_build
    passed = build_and_save(rank, rank % workers, ckpt_dir,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/commands/build.py", line 356, in build_and_save
    engine = build_model(build_config,
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/commands/build.py", line 349, in build_model
    return build(model, build_config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/builder.py", line 1288, in build
    engine = None if build_config.dry_run else builder.build_engine(
                                               ^^^^^^^^^^^^^^^^^^^^^
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/_common.py", line 210, in decorated
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mash/ai/trtllm_env312/lib/python3.12/site-packages/tensorrt_llm/builder.py", line 425, in build_engine
    assert engine is not None, 'Engine building failed, please check the error log.'
           ^^^^^^^^^^^^^^^^^^
AssertionError: Engine building failed, please check the error log.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TensorRT trying to gobble 100% GPU VRAM without considering what's already in use #4663

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TensorRT trying to gobble 100% GPU VRAM without considering what's already in use #4663

Description

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions