Hanging on multiple GPU clusters

Hi, thanks for your great work. 

I am following the instructions to install and run the test scripts.

I tried two systems, one with 4xA100 40G, the other with 4xA100 80G. 

I use the following command to run:
```
ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=4 tests/longspec_benchmark.py --target checkpoints/togethercomputer/LLaMA-2-7B-32K/model.pth --model checkpoints/TinyLlama/TinyLlama_v1.1/model.pth --model_name /home/users/jyao/ai_general_research/yueying/MagicDec/checkpoints/togethercomputer/LLaMA-2-7B-32K --rank_group 0 1 2 3 --draft_ranks 0 1 2 3 --gamma 3 --B 4 --prefix_len 16000 --gen_len 64 --streamingllm_budget 256 --benchmark
```

Both systems will hang at some point:
![image](https://github.com/user-attachments/assets/cf1234eb-fd04-4c4e-92c7-f478a7c8fc66)

For example, it stops here. The `nvidia-smi` shows $100\%$ of utilization, but I believe they are all idle due to the low power usage.
![image](https://github.com/user-attachments/assets/5fb5cdaf-b544-44d0-adbf-409e015b5481)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hanging on multiple GPU clusters #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hanging on multiple GPU clusters #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions