-
Notifications
You must be signed in to change notification settings - Fork 61
Description
Hi, thanks for releasing the code and paper!
Could you share a ballpark training duration so we can better budget the costs that would be necessary to reproduce those results?
From the paper I see you used 4 × A100 (80 GB), and Graph-R1 was trained with GRPO for 3 epochs, batch size 128, max length 4096 (per Appendix G). Also for step 4 in the README of this repo:
“Run GRPO/REINFORCE++/PPO training with Qwen2.5-3B-Instruct (Need 4 × 48 GB GPUs)”
roughly how long should we expect on 4 × 48 GB GPUs (e.g., A40s on Runpod) to reach your reported results on e.g. HotpotQA? Is this closer to a few hours, tens of GPU-hours, or hundreds of GPU-hours?
If possible, any of the following would help a ton:
- Wall-clock time per epoch (and total) on your 4 × A100 80 GB setup for Qwen2.5-3B and 7B
- Throughput you observed (tokens/sec or samples/sec), plus grad-accum and precision settings
- Approx # of training steps actually run (did you early-stop before 3 epochs or go beyond?)
- Any dataset-specific runtime notes for HotpotQA or any other dataset (e.g., typical effective sequence lengths)
Totally fine to answer at order-of-magnitude level, just trying to size the experiment budget. Thanks!