Skip to content

[Bug]: Investigate hang in torch-opt mode in all the dashboard models #10290

@MrGeva

Description

@MrGeva

System Info

Hang is seen in all the dashboard models when using torch-opt. Isolated to the first rms_norm+allreduce fusion in all strategies. when the first fusion is skipped hang is gone. hang does not happen in torch-cudagraph mode. see #9847 for more details.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

na

Expected behavior

na

actual behavior

na

additional notes

na

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Frontend<NV>Frontend of the LLM workflowbugSomething isn't working

Type

No type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions