-
Notifications
You must be signed in to change notification settings - Fork 80
Open
Description
During the process of model training, the required GPU memory is always changing. For example, when I use DAPO to train Qwen2.5-Math-7B(using four A100 GPUs), each card needs nearly 60G of GPU memory at most during the training, while it only requires less than 1G at least. Since currently, I'm sharing an 8-card A100 machine with multiple people, during the training process, the situation often occurs: the GPU memory is occupied by others immediately after being released, resulting in an "Out Of Memory" (OOM) error . Therefore, I hope to pin the GPU memory during training.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels