-
Notifications
You must be signed in to change notification settings - Fork 79
A simple env for training on H100 nodes #34
Copy link
Copy link
Open
Description
Hi @jonhue, nice work and thanks for open-sourcing the code!
I also encountered package confilcts (H100) following the instructions. Looking at #30, it seems that verl:vllm017.latest works for people. However, the assertion error below still happens at my side, possbily due to some mismatches between the infra in SDPO and vllm017:
AssertionError: local_world_size (2) must be less than or equal to the number of visible devices (1).
In my case, simply using docker pull verlai/verl:vllm012.latest makes training work on H100. Hope this helps folks using H100s :)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels