fix(doc): clarify vllm usage with grpo (axolotl-ai-cloud#2573) [skip ci]

NanoCode012 · winglian · SalmanMohammadi · web-flow · commit f1df73a798c4 · 2025-04-28T10:07:45.000-04:00
* fix(doc): clarify vllm usage with grpo

* nit

Co-authored-by: salman &lt;salman.mohammadi@outlook.com&gt;

* Update docs/rlhf.qmd

---------

Co-authored-by: Wing Lian &lt;wing@axolotl.ai&gt;
Co-authored-by: salman &lt;salman.mohammadi@outlook.com&gt;
diff --git a/docs/rlhf.qmd b/docs/rlhf.qmd
@@ -502,9 +502,7 @@ The input format is a simple JSON input with customizable fields based on the ab
 Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo).
 :::
 
-If you have multiple GPUs available, we reccomend using `vLLM` with the `GRPOTrainer` to significantly speedup trajectory generation during training.
-First, launch a `vLLM` server using `trl vllm-serve` - you may use a config file or CLI overrides to configure your vLLM server. In this example, we're
-using 4 GPUs - 2 for training, and 2 for vLLM:
+In the latest GRPO implementation, `vLLM` is used to significantly speedup trajectory generation during training. In this example, we're using 4 GPUs - 2 for training, and 2 for vLLM:
 
 ::: {.callout-important}
 Make sure you've installed the correct version of vLLM by including it as an extra when installing axolotl, e.g. `pip install axolotl[vllm]`.
@@ -539,6 +537,10 @@ Your `vLLM` instance will now attempt to spin up, and it's time to kick off trai
 CUDA_VISIBLE_DEVICES=0,1 axolotl train grpo.yaml --num-processes 2
 ```
 
+::: {.callout-note}
+Due to TRL's implementation with vLLM, the vLLM instance must use the last N GPUs instead of the first N GPUs. This is why in the example above, we use `CUDA_VISIBLE_DEVICES=2,3` for the vLLM instance.
+:::
+
 #### Reward functions
 
 GRPO uses custom reward functions and transformations. Please have them ready locally.