According to NVIDIA's official blog Inside the NVIDIA Rubin Platform: Six New Chips, One AI Supercomputer, the 3rd Gen Transformer Engine is equipped with "hardware-accelerated adaptive compression designed to boost NVFP4 performance while preserving accuracy", which enables up to 50 PetaFLOPS of NVFP4 inference capability and 35 PetaFLOPS for training.
- Could you please tell me the technical mechanism of the "hardware-accelerated adaptive compression" in the 3rd Gen Transformer Engine?
- What are the key factors that cause the NVFP4 performance gap between training (35 PFLOPS) and inference (50 PFLOPS)?