Skip to content

Hanging on multiple GPU clusters #2

@YJHMITWEB

Description

@YJHMITWEB

Hi, thanks for your great work.

I am following the instructions to install and run the test scripts.

I tried two systems, one with 4xA100 40G, the other with 4xA100 80G.

I use the following command to run:

ENABLE_INTRA_NODE_COMM=1 torchrun --standalone --nproc_per_node=4 tests/longspec_benchmark.py --target checkpoints/togethercomputer/LLaMA-2-7B-32K/model.pth --model checkpoints/TinyLlama/TinyLlama_v1.1/model.pth --model_name /home/users/jyao/ai_general_research/yueying/MagicDec/checkpoints/togethercomputer/LLaMA-2-7B-32K --rank_group 0 1 2 3 --draft_ranks 0 1 2 3 --gamma 3 --B 4 --prefix_len 16000 --gen_len 64 --streamingllm_budget 256 --benchmark

Both systems will hang at some point:
image

For example, it stops here. The nvidia-smi shows $100%$ of utilization, but I believe they are all idle due to the low power usage.
image

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions