Skip to content

Commit eaaafb6

Browse files
committed
chore: add updates
Signed-off-by: Dheeraj Peri <[email protected]>
1 parent 7c2d4a0 commit eaaafb6

File tree

1 file changed

+25
-17
lines changed

1 file changed

+25
-17
lines changed

docsrc/tutorials/compile_groot.rst

Lines changed: 25 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ The primary entry point for model compilation and benchmarking is ``run_groot_to
5656
The ``fn_name`` argument allows users to target specific submodules of the GR00T N1.5 model for optimization, which is particularly useful for profiling and debugging individual components. For example, to benchmark the Vision Transformer module in FP16 precision mode, run:
5757

5858
.. code-block:: bash
59+
5960
python run_groot_torchtrt.py \
6061
--precision FP16 \
6162
--use_fp32_acc \
@@ -71,24 +72,27 @@ VLA Optimizations
7172

7273
The following optimizations have been applied to the components of the GR00T N1.5 model to improve performance using Torch-TensorRT:
7374

74-
1) Vision Transformer (ViT)
75-
- The ViT component is optimized by using the Torch-TensorRT MutableTorchTensorRTModule (MTTM). TensorRT optimizations include layer fusion, kernel auto-tuning, dynamic shape handling.
76-
- FP8 quantization support is available to reduce model size and improve performance.
77-
- For the SiglipVisionModel, the ``SiglipMultiheadAttentionPoolingHead`` of the ViT component is disabled to eliminate unnecessary latency overhead, as this layer is not utilized by the downstream model. See the implementation `here <https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L258-L261>`_.
75+
* Vision Transformer (ViT)
76+
* The ViT component is optimized by using the Torch-TensorRT MutableTorchTensorRTModule (MTTM). TensorRT optimizations include layer fusion, kernel auto-tuning, dynamic shape handling.
77+
* FP8 quantization support is available to reduce model size and improve performance.
78+
* For the SiglipVisionModel, the ``SiglipMultiheadAttentionPoolingHead`` of the ViT component is disabled to eliminate unnecessary latency overhead, as this layer is not utilized by the downstream model. See the `ViT implementation <https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L258-L261>`_.
7879

79-
2) Text Transformer (LLM)
80-
- MTTM support for the LLM component and similar TensorRT optimizations apply.
81-
- FP8 quantization support is available to reduce model size and improve performance.
80+
* Text Transformer (LLM)
81+
* MTTM support for the LLM component and similar TensorRT optimizations apply.
82+
* FP8 quantization support is available to reduce model size and improve performance.
8283

83-
3) Flow Matching Action Head
84-
- The Flow Matching Action Head component consists of 5 different components: VLM backbone processor, State encoder, Action encoder and Action decoder and DiT.
85-
- VLM backbone processor uses a LayerNorm layer and Self-Attention Transformer network to process the outputs of VLM (ViT + LLM). We merge these two components into a single ``torch.nn.Module`` to minimize graph fragmentation and improve performance. See the implementation `here <https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L485-L506>`_.
86-
- State encoder, Action encoder and Action Decoder use a Multi-Layer Perceptron (MLP) like networks to encode the state and action vectors. These are wrapped with MTTM and standard TensorRT optimizations apply.
87-
- DiT is a Diffusion-Transformer model that is used to generate the action vector. It is wrapped with MTTM and standard TensorRT optimizations apply. FP8 quantization support is available to this component.
84+
* Flow Matching Action Head
85+
* The Flow Matching Action Head component consists of 5 different components: VLM backbone processor, State encoder, Action encoder and Action decoder and DiT.
86+
* VLM backbone processor uses a LayerNorm layer and Self-Attention Transformer network to process the outputs of VLM (ViT + LLM). We merge these two components into a single ``torch.nn.Module`` to minimize graph fragmentation and improve performance. See the `VLM backbone implementation <https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L485-L506>`_.
87+
* State encoder, Action encoder and Action Decoder use a Multi-Layer Perceptron (MLP) like networks to encode the state and action vectors. These are wrapped with MTTM and standard TensorRT optimizations apply.
88+
* DiT is a Diffusion-Transformer model that is used to generate the action vector. It is wrapped with MTTM and standard TensorRT optimizations apply. FP8 quantization support is available to this component.
89+
90+
* Module Merging
91+
* In some cases, similar to the VLM backbone processor, the eagle backbone (ViT + LLM) can be compiled jointly into a single ``torch.nn.Module`` to minimize graph fragmentation and improve performance. While this approach may yield better runtime performance, compiling these modules independently can be more CPU memory efficient during TensorRT compilation while still achieving comparable inference performance to the merged module.
8892

89-
4) Dynamic Shape Management
90-
- A general optimization that provides performance improvements is using dynamic shapes only when necessary. In the GR00T N1.5 model, dynamic shapes are applied selectively to the LLM and DiT components where input dimensions may vary.
91-
- For components with predictable input sizes, fixed batch dimensions are preferred. Specifying a batch size as dynamic can reduce performance compared to a fixed batch size when the dimensions are known in advance, as TensorRT can apply more aggressive optimizations with static shapes.
93+
* Dynamic Shape Management
94+
* A general optimization that provides performance improvements is using dynamic shapes only when necessary. In the GR00T N1.5 model, dynamic shapes are applied selectively to the LLM and DiT components where input dimensions may vary.
95+
* For components with predictable input sizes, fixed batch dimensions are preferred. Specifying a batch size as dynamic can reduce performance compared to a fixed batch size when the dimensions are known in advance, as TensorRT can apply more aggressive optimizations with static shapes.
9296

9397
While these optimizations have been specifically applied to the GR00T N1.5 model, many of them are generalizable to other Vision-Language-Action (VLA) models. Techniques such as selective dynamic shape management, component-level MTTM wrapping, and FP8 quantization can be adapted to similar architectures to achieve comparable performance improvements.
9498

@@ -99,15 +103,19 @@ RoboCasa is a large-scale simulation framework for training generally capable ro
99103
in RoboCasa simulation environment to better understand its behavior in closed-loop settings. This is especially useful for assessing quantitative performance on long-horizon or multi-step tasks.
100104

101105
Please follow these `instructions <https://github.com/robocasa/robocasa-gr1-tabletop-tasks?tab=readme-ov-file#getting-started>`_ to set up the RoboCasa simulation environment.
102-
Once you setup the environment, you can run the following command to start the simulation from ``Isaac-GR00T`` directory:
106+
Once you setup the environment, you can run the following command to start the simulation from ``Isaac-GR00T`` directory
107+
103108
.. code-block:: bash
109+
104110
cd Isaac-GR00T
105111
python3 scripts/inference_service.py --server --model_path nvidia/GR00T-N1.5-3B --data_config fourier_gr1_arms_waist --use_torch_tensorrt --vit_dtype fp16 --llm_dtype fp16 --dit_dtype fp16 --precision fp16
106112
107113
This would compile the GR00T N1.5 model using Torch-TensorRT and start the inference service at port 5555.
108114

109-
You can then use the following command to start the simulation:
115+
You can then use the following command to start the simulation
116+
110117
.. code-block:: bash
118+
111119
cd robocasa-gr1-tabletop-tasks
112120
python3 scripts/simulation_service.py --client --env_name gr1_unified/PnPCupToDrawerClose_GR1ArmsAndWaistFourierHands_Env --video_dir ./videos --max_episode_steps 720 --n_envs 1 --n_episodes 10 --use_torch_tensorrt
113121

0 commit comments

Comments
 (0)