CPU Memory Consumption Reduction #3839
Replies: 3 comments 1 reply
-
In Qwen, TRT builder uses 1x to build live engine. |
Beta Was this translation helpful? Give feedback.
-
INetworkDefinition does not take any memory actually, it is the lowered graph and constand folding takes the up to 1x (0-1x) memory. Code here: INetwork is just hold the reference to the weights in the lowered graph |
Beta Was this translation helpful? Give feedback.
-
SummaryAdd four opt-in memory modes to make Torch-TensorRT predictable and usable on constrained machines. Modes are selectable via Python API and env vars. Modes1) standard (default)Behavior: No extra memory optimization. Lifecycle: Models/engines stay resident on GPU for compile + run. Use when: You have ample CPU and GPU memory and want a balance between CPU and GPU memory consumption. Consumption: CPU memory would use ~4x of the model size. GPU will use 2x of the model size. 2) low_CPU_ramGoal: Cut host memory consumption during compile/run. Techniques: After engine building, we use malloc_trim to release memory. Before serialization, the CPU memory consumption is reduced to a minimum (less than 1x), and serialization takes up to 2x. So the peak memory usage will be less than 3x in general. Use when: you have enough GPU memory (>2x model size) and limited CPU memory (<3x model size). Risk: The stablilty of malloc_trim is experimental 3) low_GPU_vRAMGoal: Lower peak GPU VRAM. Techniques:
Use when: you don't have enough GPU memory (<2x model size) and enough CPU memory (>5x model size). 4) all_onGoal: Run under tight CPU & GPU budgets. Techniques: low_CPU_ram + low_GPU_vRAM combined |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Beta Was this translation helpful? Give feedback.
All reactions