Skip to content

[Bug][AutoDeploy]: Sharding fails on NemotronH hybrid models - layer detection groups multiple SSM blocks #10358

@tcherckez-nvidia

Description

@tcherckez-nvidia

System Info

H100

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

python3 /opt/tensorrt-llm/examples/auto_deploy/build_and_run_ad.py --model nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 --args.yaml-extra /opt/tensorrt-llm/examples/auto_deploy/model_registry/configs/dashboard_default.yaml --args.yaml-extra /opt/tensorrt-llm/examples/auto_deploy/model_registry/configs/world_size_2.yaml

Expected behavior

model should build

actual behavior

0: File "/opt/tensorrt-llm/tensorrt_llm/_torch/auto_deploy/utils/node_utils.py", line 490, in get_all_layer_subgraphs 0: layer_subgraph = get_layer_after_linear_node(linear_nodes, terminating_indices) 0: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 0: File "/opt/tensorrt-llm/tensorrt_llm/_torch/auto_deploy/utils/node_utils.py", line 871, in get_layer_after_linear_node 0: assert len(ssm_nodes) == 1, "SSM layer must have exactly one SSM node" 0: ^^^^^^^^^^^^^^^^^^^ 0: AssertionError: SSM layer must have exactly one SSM node

additional notes

Running nvidia/NVIDIA-Nemotron-Nano-9B-v2-NVFP4 through Auto Deploy fails during the sharding transform with an assertion error about multiple SSM nodes being found in a single detected layer.

Root Cause Analysis
The get_layer_after_linear_node function uses BFS traversal to detect layer boundaries based on linear projections with matching embedding dimensions. For NemotronH hybrid models with multiple consecutive Mamba blocks, if those blocks have compatible linear projection shapes, they get grouped into a single "layer"

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

AutoDeploy<NV> AutoDeploy BackendbugSomething isn't working

Type

No type

Projects

Status

In review

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions