When running bash recipe/retool/run_qwen2_7b_dapo.sh, I observed that many Ray Core tasks seemed to fail. At the beginning, everything looked normal, but later failures increased. However, training still appeared to continue — GPUs were running at full power, and WanDB was still logging. The training speed, though, was extremely slow. On 4×A800 GPUs, the estimated training time showed as 2 years.
在使用 bash recipe/retool/run_qwen2_7b_dapo.sh
观察 Ray 很多 Ray Core task 似乎都是Failed ?
我注意到 最开始的时候 都是正常的?随后后续训练失败的开始越来越多
但是观察训练似乎还在进行?GPU在满功耗运行 同时观察WanDB 也是
但是训练速度十分缓慢?
我在4*A800显卡上训练 ,显示需要的训练时间2年?
When running
bash recipe/retool/run_qwen2_7b_dapo.sh, I observed that many Ray Core tasks seemed to fail. At the beginning, everything looked normal, but later failures increased. However, training still appeared to continue — GPUs were running at full power, and WanDB was still logging. The training speed, though, was extremely slow. On 4×A800 GPUs, the estimated training time showed as 2 years.在使用 bash recipe/retool/run_qwen2_7b_dapo.sh
观察 Ray 很多 Ray Core task 似乎都是Failed ?
我注意到 最开始的时候 都是正常的?随后后续训练失败的开始越来越多
但是观察训练似乎还在进行?GPU在满功耗运行 同时观察WanDB 也是
但是训练速度十分缓慢?
我在4*A800显卡上训练 ,显示需要的训练时间2年?