[Bug][AutoDeploy] Execution hangs when enabling cudagraphs and AllReduceStrategy.AUTO

### System Info

CW-DFW. TRTLLM main

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Token generation hangs when running the build_and_run_ad.py script with torch-cudagraphs and AllReduceStrategy.AUTO


Repro steps:
In https://github.com/NVIDIA/TensorRT-LLM/blob/main/tensorrt_llm/_torch/auto_deploy/distributed/trtllm.py#L21
change NCCL->AUTO

```
MODEL=meta-llama/Llama-3.1-8B-Instruct
python examples/auto_deploy/build_and_run_ad.py --model $MODEL --args.world-size 4 --args.compile_backend=torch-cudagraph
```



### Expected behavior

torch-simple works and produces legible outputs

### actual behavior

hang

### additional notes

n/a

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug][AutoDeploy] Execution hangs when enabling cudagraphs and AllReduceStrategy.AUTO #8781

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug][AutoDeploy] Execution hangs when enabling cudagraphs and AllReduceStrategy.AUTO #8781

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions