Skip to content

[BUG] run_qwen2_7b_dapo execution failed #39

@noob000007

Description

@noob000007
Image

When running bash recipe/retool/run_qwen2_7b_dapo.sh, I observed that many Ray Core tasks seemed to fail. At the beginning, everything looked normal, but later failures increased. However, training still appeared to continue — GPUs were running at full power, and WanDB was still logging. The training speed, though, was extremely slow. On 4×A800 GPUs, the estimated training time showed as 2 years.
在使用 bash recipe/retool/run_qwen2_7b_dapo.sh
观察 Ray 很多 Ray Core task 似乎都是Failed ?
我注意到 最开始的时候 都是正常的?随后后续训练失败的开始越来越多
但是观察训练似乎还在进行?GPU在满功耗运行 同时观察WanDB 也是
但是训练速度十分缓慢?
我在4*A800显卡上训练 ,显示需要的训练时间2年?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions