You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docsrc/tutorials/compile_groot.rst
+25-17Lines changed: 25 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -56,6 +56,7 @@ The primary entry point for model compilation and benchmarking is ``run_groot_to
56
56
The ``fn_name`` argument allows users to target specific submodules of the GR00T N1.5 model for optimization, which is particularly useful for profiling and debugging individual components. For example, to benchmark the Vision Transformer module in FP16 precision mode, run:
57
57
58
58
.. code-block:: bash
59
+
59
60
python run_groot_torchtrt.py \
60
61
--precision FP16 \
61
62
--use_fp32_acc \
@@ -71,24 +72,27 @@ VLA Optimizations
71
72
72
73
The following optimizations have been applied to the components of the GR00T N1.5 model to improve performance using Torch-TensorRT:
73
74
74
-
1) Vision Transformer (ViT)
75
-
- The ViT component is optimized by using the Torch-TensorRT MutableTorchTensorRTModule (MTTM). TensorRT optimizations include layer fusion, kernel auto-tuning, dynamic shape handling.
76
-
- FP8 quantization support is available to reduce model size and improve performance.
77
-
- For the SiglipVisionModel, the ``SiglipMultiheadAttentionPoolingHead`` of the ViT component is disabled to eliminate unnecessary latency overhead, as this layer is not utilized by the downstream model. See the implementation `here<https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L258-L261>`_.
75
+
* Vision Transformer (ViT)
76
+
* The ViT component is optimized by using the Torch-TensorRT MutableTorchTensorRTModule (MTTM). TensorRT optimizations include layer fusion, kernel auto-tuning, dynamic shape handling.
77
+
* FP8 quantization support is available to reduce model size and improve performance.
78
+
* For the SiglipVisionModel, the ``SiglipMultiheadAttentionPoolingHead`` of the ViT component is disabled to eliminate unnecessary latency overhead, as this layer is not utilized by the downstream model. See the `ViT implementation<https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L258-L261>`_.
78
79
79
-
2) Text Transformer (LLM)
80
-
- MTTM support for the LLM component and similar TensorRT optimizations apply.
81
-
- FP8 quantization support is available to reduce model size and improve performance.
80
+
* Text Transformer (LLM)
81
+
* MTTM support for the LLM component and similar TensorRT optimizations apply.
82
+
* FP8 quantization support is available to reduce model size and improve performance.
82
83
83
-
3) Flow Matching Action Head
84
-
- The Flow Matching Action Head component consists of 5 different components: VLM backbone processor, State encoder, Action encoder and Action decoder and DiT.
85
-
- VLM backbone processor uses a LayerNorm layer and Self-Attention Transformer network to process the outputs of VLM (ViT + LLM). We merge these two components into a single ``torch.nn.Module`` to minimize graph fragmentation and improve performance. See the implementation `here <https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L485-L506>`_.
86
-
- State encoder, Action encoder and Action Decoder use a Multi-Layer Perceptron (MLP) like networks to encode the state and action vectors. These are wrapped with MTTM and standard TensorRT optimizations apply.
87
-
- DiT is a Diffusion-Transformer model that is used to generate the action vector. It is wrapped with MTTM and standard TensorRT optimizations apply. FP8 quantization support is available to this component.
84
+
* Flow Matching Action Head
85
+
* The Flow Matching Action Head component consists of 5 different components: VLM backbone processor, State encoder, Action encoder and Action decoder and DiT.
86
+
* VLM backbone processor uses a LayerNorm layer and Self-Attention Transformer network to process the outputs of VLM (ViT + LLM). We merge these two components into a single ``torch.nn.Module`` to minimize graph fragmentation and improve performance. See the `VLM backbone implementation <https://github.com/peri044/Isaac-GR00T/blob/6b34a65e02b07b19d689498ec75066792b4bb738/deployment_scripts/run_groot_torchtrt.py#L485-L506>`_.
87
+
* State encoder, Action encoder and Action Decoder use a Multi-Layer Perceptron (MLP) like networks to encode the state and action vectors. These are wrapped with MTTM and standard TensorRT optimizations apply.
88
+
* DiT is a Diffusion-Transformer model that is used to generate the action vector. It is wrapped with MTTM and standard TensorRT optimizations apply. FP8 quantization support is available to this component.
89
+
90
+
* Module Merging
91
+
* In some cases, similar to the VLM backbone processor, the eagle backbone (ViT + LLM) can be compiled jointly into a single ``torch.nn.Module`` to minimize graph fragmentation and improve performance. While this approach may yield better runtime performance, compiling these modules independently can be more CPU memory efficient during TensorRT compilation while still achieving comparable inference performance to the merged module.
88
92
89
-
4) Dynamic Shape Management
90
-
- A general optimization that provides performance improvements is using dynamic shapes only when necessary. In the GR00T N1.5 model, dynamic shapes are applied selectively to the LLM and DiT components where input dimensions may vary.
91
-
- For components with predictable input sizes, fixed batch dimensions are preferred. Specifying a batch size as dynamic can reduce performance compared to a fixed batch size when the dimensions are known in advance, as TensorRT can apply more aggressive optimizations with static shapes.
93
+
* Dynamic Shape Management
94
+
* A general optimization that provides performance improvements is using dynamic shapes only when necessary. In the GR00T N1.5 model, dynamic shapes are applied selectively to the LLM and DiT components where input dimensions may vary.
95
+
* For components with predictable input sizes, fixed batch dimensions are preferred. Specifying a batch size as dynamic can reduce performance compared to a fixed batch size when the dimensions are known in advance, as TensorRT can apply more aggressive optimizations with static shapes.
92
96
93
97
While these optimizations have been specifically applied to the GR00T N1.5 model, many of them are generalizable to other Vision-Language-Action (VLA) models. Techniques such as selective dynamic shape management, component-level MTTM wrapping, and FP8 quantization can be adapted to similar architectures to achieve comparable performance improvements.
94
98
@@ -99,15 +103,19 @@ RoboCasa is a large-scale simulation framework for training generally capable ro
99
103
in RoboCasa simulation environment to better understand its behavior in closed-loop settings. This is especially useful for assessing quantitative performance on long-horizon or multi-step tasks.
100
104
101
105
Please follow these `instructions <https://github.com/robocasa/robocasa-gr1-tabletop-tasks?tab=readme-ov-file#getting-started>`_ to set up the RoboCasa simulation environment.
102
-
Once you setup the environment, you can run the following command to start the simulation from ``Isaac-GR00T`` directory:
106
+
Once you setup the environment, you can run the following command to start the simulation from ``Isaac-GR00T`` directory
0 commit comments