Skip to content

Commit f1df73a

Browse files
NanoCode012winglianSalmanMohammadi
authored
fix(doc): clarify vllm usage with grpo (axolotl-ai-cloud#2573) [skip ci]
* fix(doc): clarify vllm usage with grpo * nit Co-authored-by: salman <[email protected]> * Update docs/rlhf.qmd --------- Co-authored-by: Wing Lian <[email protected]> Co-authored-by: salman <[email protected]>
1 parent 8b33ae1 commit f1df73a

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

docs/rlhf.qmd

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -502,9 +502,7 @@ The input format is a simple JSON input with customizable fields based on the ab
502502
Check out our [GRPO cookbook](https://github.com/axolotl-ai-cloud/axolotl-cookbook/tree/main/grpo#training-an-r1-style-large-language-model-using-grpo).
503503
:::
504504

505-
If you have multiple GPUs available, we reccomend using `vLLM` with the `GRPOTrainer` to significantly speedup trajectory generation during training.
506-
First, launch a `vLLM` server using `trl vllm-serve` - you may use a config file or CLI overrides to configure your vLLM server. In this example, we're
507-
using 4 GPUs - 2 for training, and 2 for vLLM:
505+
In the latest GRPO implementation, `vLLM` is used to significantly speedup trajectory generation during training. In this example, we're using 4 GPUs - 2 for training, and 2 for vLLM:
508506

509507
::: {.callout-important}
510508
Make sure you've installed the correct version of vLLM by including it as an extra when installing axolotl, e.g. `pip install axolotl[vllm]`.
@@ -539,6 +537,10 @@ Your `vLLM` instance will now attempt to spin up, and it's time to kick off trai
539537
CUDA_VISIBLE_DEVICES=0,1 axolotl train grpo.yaml --num-processes 2
540538
```
541539

540+
::: {.callout-note}
541+
Due to TRL's implementation with vLLM, the vLLM instance must use the last N GPUs instead of the first N GPUs. This is why in the example above, we use `CUDA_VISIBLE_DEVICES=2,3` for the vLLM instance.
542+
:::
543+
542544
#### Reward functions
543545

544546
GRPO uses custom reward functions and transformations. Please have them ready locally.

0 commit comments

Comments
 (0)