TorchTRT Compilation Memory Consumption Management #3839

narendasan · 2025-09-25T18:51:11Z

narendasan
Sep 25, 2025
Collaborator

cehongwang · 2025-09-26T00:26:26Z

cehongwang
Sep 26, 2025
Collaborator

In Qwen, TRT builder uses 1x to build live engine.
Sometimes TRT builder uses more than 2x even for building live engine: FLUX, GPT-OSS

0 replies

cehongwang · 2025-09-26T23:26:13Z

cehongwang
Sep 26, 2025
Collaborator

INetworkDefinition does not take any memory actually, it is the lowered graph and constand folding takes the up to 1x (0-1x) memory. Code here:

TensorRT/py/torch_tensorrt/dynamo/lowering/passes/constant_folding.py

Line 39 in 763fd08

replace_node_with_constant(

INetwork is just hold the reference to the weights in the lowered graph

0 replies

cehongwang · 2025-10-07T00:00:16Z

cehongwang
Oct 7, 2025
Collaborator

Summary

Add four opt-in memory modes to make Torch-TensorRT predictable and usable on constrained machines. Modes are selectable via Python API and env vars.

Modes

1) standard (default)

Behavior: No extra memory optimization.

Lifecycle: Models/engines stay resident on GPU for compile + run.

Use when: You have ample CPU and GPU memory and want a balance between CPU and GPU memory consumption.

Consumption: CPU memory would use ~4x of the model size. GPU will use 2x of the model size.

2) low_CPU_ram

Goal: Cut host memory consumption during compile/run.

Techniques:

After engine building, we use malloc_trim to release memory. Before serialization, the CPU memory consumption is reduced to a minimum (less than 1x), and serialization takes up to 2x. So the peak memory usage will be less than 3x in general.

Use when: you have enough GPU memory (>2x model size) and limited CPU memory (<3x model size).
Consumption: CPU memory would use 2-3x of the model size. GPU will use 2x of the model size.

Risk: The stablilty of malloc_trim is experimental

3) low_GPU_vRAM

Goal: Lower peak GPU VRAM.

Techniques:

Lazy engine initialization for graphs with breaks (build when first used).
Engine offload to CPU memory before building the engine.

Use when: you don't have enough GPU memory (<2x model size) and enough CPU memory (>5x model size).
Consumption: CPU memory would use 5x of the model size. GPU will use 1x of the model size.

4) all_on

Goal: Run under tight CPU & GPU budgets.

Techniques: low_CPU_ram + low_GPU_vRAM combined
Use when: you don't have enough GPU memory (<2x model size) and don't have much CPU memory (<5x model size).
Consumption: CPU memory would use 3-4x of the model size. GPU will use 1x of the model size.

4 replies

narendasan Oct 7, 2025
Collaborator Author

Is all on conflicting? Like what is the expected consumption if cpu offload and we do malloc trim?

cehongwang Oct 8, 2025
Collaborator

cpu offload will add 1 copy to sole malloc trim

zewenli98 Oct 31, 2025
Collaborator

Is there any drawback for all_on mode? Any advantage for standard mode?

narendasan Oct 31, 2025
Collaborator Author

I think we want to get to all on but i think they might interfere with each other

narendasan · 2025-10-16T23:43:44Z

narendasan
Oct 16, 2025
Collaborator Author

Resource Aware Graph Sharding

TL;DR

We found that if we split graphs into small distinct engines we can get roughly the same perf but reduce peak memory consumption. So we want to design a phase of the compiler to cut the graph up. In the partitioning step, we break the graph without breaking fusions to ensure compiling the biggest part of the graph does not exceed the CPU RAM budget.

Goal(s)

Lower the peak memory consumption for compiling a model while minimizing the performance lost due to the graph break.

Usecases

Proposed APIs / UX

Phase 1 (Experimental / Beta Stability) [2.10]

User opt in API:

torch_tensorrt.compile(module, resource_aware_sharding=True)

Default behavior is we estimate the max graph size and shard accordingly

Resource Budget

torch_tensorrt.compile(module, compile_peak_host_memory_consumption=1e10)

We take this as the max CPU memory we can use and shard.

Phase 1 (Stable) [2.11+]

torch_tensorrt.compile(module, disable_resource_aware_sharding=False)

Default behavior is we estimate the max graph size and shard accordingly without user intervention

Resource Budget

torch_tensorrt.compile(module, compile_peak_host_memory_consumption=1e10)

We take this as the max CPU memory we can use and shard.

Example Workflow

Limitations

Internal Implementation

Design

This should be the last pass that runs before compilation and only apply to TRT subgraphs which exceed the resource budget.

We need to be able to estimate the size (resource consumption) of a TRT subgraph.
-- (# of parameters * FP32) / 8 = bytes

We need to intelligently split the graphs to minimize the perf loss
- i.e. Don't split operations that would otherwise get fused in TRT

Concept:

Assumptions

Parameters are roughly evenly distributed (we should verify and see if we need to adjust)
Size of the parameters are evenly distributed

Extensions Required to Core API implementations

Extend TRTPartitioner to complete the shard
Expose this in dynamo.compile API

Data Structures

Some very simple mechanism to define a atomic subgraph.

graph(x, w, b, scale, ...):
    aten::conv
    aten::batch_norm
    aten::relu
    return

graph(x, w, b):
    aten::mm
    aten::add
    return

subgraph matching

Details specific for FX support

Implementation Phases

Prototype -

MVP `(<TARGET RELEASE VERSION>)`

TRTPartitioner successfully shard the graph to several subgraph with roughly equal size that fits into CPU RAM
Easy API design that allows users to use this feature

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Automatically shard the graph if insufficient CPU memory is detected. User don't need to manually enable this feature

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

Use this feature along with other CPU memory optimization features

0 replies

narendasan · 2025-10-16T23:51:47Z

narendasan
Oct 16, 2025
Collaborator Author

Ensure that outputs are compatible with inputs

0 replies

TorchTRT Compilation Memory Consumption Management #3839

Uh oh!

narendasan Sep 25, 2025 Collaborator

Replies: 5 comments · 4 replies

Uh oh!

cehongwang Sep 26, 2025 Collaborator

Uh oh!

cehongwang Sep 26, 2025 Collaborator

Uh oh!

Uh oh!

cehongwang Oct 7, 2025 Collaborator

Modes

1) standard (default)

2) low_CPU_ram

3) low_GPU_vRAM

4) all_on

Uh oh!

narendasan Oct 7, 2025 Collaborator Author

Uh oh!

cehongwang Oct 8, 2025 Collaborator

Uh oh!

zewenli98 Oct 31, 2025 Collaborator

Uh oh!

narendasan Oct 31, 2025 Collaborator Author

Uh oh!

Uh oh!

narendasan Oct 16, 2025 Collaborator Author

Resource Aware Graph Sharding

TL;DR

Goal(s)

Usecases

Proposed APIs / UX

Phase 1 (Experimental / Beta Stability) [2.10]

Phase 1 (Stable) [2.11+]

Example Workflow

Limitations

Internal Implementation

Design

Assumptions

Extensions Required to Core API implementations

Data Structures

Details specific for FX support

Implementation Phases

Prototype -

MVP (<TARGET RELEASE VERSION>)

Extension Phase 1 (<TARGET RELEASE VERSION>)

Extension Phase 2 (<TARGET RELEASE VERSION>)

Uh oh!

narendasan Oct 16, 2025 Collaborator Author

narendasan
Sep 25, 2025
Collaborator

Replies: 5 comments 4 replies

cehongwang
Sep 26, 2025
Collaborator

cehongwang
Sep 26, 2025
Collaborator

cehongwang
Oct 7, 2025
Collaborator

narendasan Oct 7, 2025
Collaborator Author

cehongwang Oct 8, 2025
Collaborator

zewenli98 Oct 31, 2025
Collaborator

narendasan Oct 31, 2025
Collaborator Author

narendasan
Oct 16, 2025
Collaborator Author

MVP `(<TARGET RELEASE VERSION>)`

Extension Phase 1 `(<TARGET RELEASE VERSION>)`

Extension Phase 2 `(<TARGET RELEASE VERSION>)`

narendasan
Oct 16, 2025
Collaborator Author