Hello,
I run verl-agent/examples/grpo_trainer/run_webshop.sh in the webshop to test the performance of the GRPO baseline and found that the final results are much higher than the results reported in the paper, even higher than SPEAR. Could you please explain why this is happening and why this situation occurs?
