Skip to content

CUDA OUT OF MEMORY!!!!!!!!!!!!!!!!!! WHEN GRPO..... HELP!!! #6393

@ragelgq

Description

@ragelgq

#!/bin/bash
#SBATCH -J rlhf_grpo
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p a01
#SBATCH --gres=gpu:4 # 请求 4 张 GPU (A800)
#SBATCH --cpus-per-task=32
#SBATCH --time=96:00:00
#SBATCH --mem=0

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
NPROC_PER_NODE=4
CUDA_VISIBLE_DEVICES=0,1,2,3
swift rlhf
--rlhf_type grpo
--model /WORK/
--external_plugins /
--reward_funcs reward
--reward_weights 1.0
--use_vllm true
--vllm_mode server
--vllm_server_host gxx
--vllm_server_port xxxx
--train_type lora
--torch_dtype bfloat16
--lora_rank 16
--lora_alpha 32
--dataset /WORK/PUBLIC
--load_from_cache_file true
--max_length 3072
--max_completion_length 3072
--num_train_epochs 2
--per_device_train_batch_size 2
--per_device_eval_batch_size 2
--learning_rate 1e-6
--gradient_accumulation_steps 4
--eval_steps 1
--save_steps 1
--save_total_limit 30
--logging_steps 1
--output_dir /WORK/PUBLIC/
--num_generations 8
--temperature 0.9
--warmup_ratio 0.05
--dataloader_num_workers 4
--dataset_num_proc 4
--deepspeed zero3
--num_iterations 1
--beta 0.04

echo "[date] === swift rlhf job finished ==="

如上是我的slurm脚本,我的max length和completion length不能再截断了,已经修改无数次了,zero3也开了,最后总显示还有250MB需要分配,但空间不够,怎么办,rlhf阶段能像megatron那样开启TP DP吗 感谢大佬分享解决方法

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions