-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Open
Labels
Module:QuantizationIssues related to QuantizationIssues related to QuantizationModule:RuntimeOther generic runtime issues that does not fall into other modulesOther generic runtime issues that does not fall into other modules
Description
Title
Jetson Thor (SM 110) advertises FP8/FP4 throughput but TensorRT silently falls back to FP32 when FP8/FP4 flags are enabled
Summary
- Expectation: Jetson Thor datasheet claims 1,035 TFLOPS (Dense FP4 | Sparse FP8 | Sparse INT8) and 517 TFLOPS (Dense FP8 | Sparse FP16), so the platform should accelerate FP8/FP4 in TensorRT.
- Reality: On Jetson Thor (compute capability 11.0), TensorRT 10.13.3.9 accepts
BuilderFlag.FP8/BuilderFlag.FP4but silently builds FP32 engines (larger files,DataType.FLOAToutputs, FP32-scale weights). No error or warning indicates the fallback. - Impact: Users targeting advertised low-precision formats waste time debugging. Real throughput stays at FP32, contrary to product specs.
Environment
- Device: NVIDIA Jetson Thor developer kit (GPU compute capability 11.0 / SM 110)
- OS: Jetson Linux (default Thor image)
- CUDA: 13.0
- TensorRT: 10.13.3.9 (Python API via /usr/bin/python3)
- Python: 3.12
- Model:
examples/gpt2.onnxfrom https://github.com/commaai/bodyjim/blob/master/examples/roam.py (comma.ai bodyjim)
Reproduction Steps
- Prepare Jetson Thor with TensorRT 10.13.3.9 and CUDA 13.0.
- Parse an ONNX model and request FP8:
import tensorrt as trt from pathlib import Path logger = trt.Logger(trt.Logger.VERBOSE) builder = trt.Builder(logger) network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) parser = trt.OnnxParser(network, logger) onnx_path = Path("examples/gpt2.onnx") with open(onnx_path, "rb") as f: parser.parse(f.read()) config = builder.create_builder_config() config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30) config.set_flag(trt.BuilderFlag.FP8) # Same behavior with FP4 serialized_engine = builder.build_serialized_network(network, config) Path("gpt2_fp8.plan").write_bytes(serialized_engine)
- Inspect the plan:
runtime = trt.Runtime(logger) engine = runtime.deserialize_cuda_engine(Path("gpt2_fp8.plan").read_bytes()) for i in range(engine.num_io_tensors): name = engine.get_tensor_name(i) dtype = engine.get_tensor_dtype(name) print(name, dtype)
- Compare plan sizes:
# FP16 reference (TensorRT flag FP16): gpt2_fp16.plan ≈ 199 MB # FP8 request: gpt2_fp8.plan ≈ 382 MB (roughly double FP16, matching FP32) - Observe TensorRT console logs:
[TRT] [I] Total Weights Memory: 399224448 bytes ... Outputs: logits → DataType.FLOAT - Same behavior with FP4 flag: plan size ≈ 382 MB, outputs still
DataType.FLOAT.
Actual Behavior
build_serialized_networkreturns a plan without errors.- Generated engine sizes match FP32, not FP8/FP4.
- Runtime reports outputs as FP32 (
DataType.FLOAT). - No warning indicates fallback. Users believe FP8 succeeded but receive FP32 performance.
Expected Behavior
- Builder should fail fast (or emit an explicit warning) if FP8/FP4 are unsupported on the target SM.
- Alternatively, TensorRT should honor the hardware claims and produce true FP8/FP4 rings on Jetson Thor.
Additional Context
- Jetson Thor architecture sheet advertises:
- 1,035 TFLOPS (Dense FP4 | Sparse FP8 | Sparse INT8)
- 517 TFLOPS (Dense FP8 | Sparse FP16)
- 2070 TFLOPS (Sparse FP4)
- Jetson Thor GPU is compute capability 11.0, so FP8/FP4 should be available if the specs are accurate.
- FP16 builds succeed with expected size (~199 MB) and
DataType.HALFoutputs. - Removing the verification we added means customers silently run FP32.
Request
Please clarify:
- Is FP8/FP4 officially supported on Jetson Thor in TensorRT 10.13.3.9?
- If not, can TensorRT fail with an explicit message instead of silently building FP32?
- If yes, how can we produce true FP8/FP4 engines on SM 110?
We’re ready to share full scripts, logs, and plan files if needed.
FrostyFridgebowCine89
Metadata
Metadata
Assignees
Labels
Module:QuantizationIssues related to QuantizationIssues related to QuantizationModule:RuntimeOther generic runtime issues that does not fall into other modulesOther generic runtime issues that does not fall into other modules