-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
System Info
----------Python Info----------
Version : 3.10.12
Compiler : GCC 11.4.0
Build : ('main', 'Jul 29 2024 16:56:48')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 25.1.1
vllm : 0.10.0
ray : 2.47.1
torch : 2.7.1+cu126
----------verl Info-----------
Version : 0.7.0.dev
----------Platform Info----------
Platform : Linux-5.10.0-34-amd64-x86_64-with-glibc2.35
system : Linux
node : debian
release : 5.10.0-34-amd64
version : #1 SMP Debian 5.10.234-1 (2025-02-24)
----------Environment----------
VERL_LOGGING_LEVEL=''
CUDA Runtime : 12.6
CUDA Compiler : Cuda compilation tools, release 12.6, V12.6.20
----------System Info----------
CPU Memory : 187 GB
GPU Count : 2
GPU 1 Type : NVIDIA A100 80GB PCIe
GPU 1 Memory : 81920 MiB
GPU 2 Type : NVIDIA A100 80GB PCIe
GPU 2 Memory : 81920 MiB
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
MODEL=${HOME}/verl/model/Qwen2.5-0.5B-Instruct
TRAIN_SET=${HOME}/verl/data/gsm8k/train_ID.parquet
VAL_SET=${HOME}/verl/data/gsm8k/test.parquet
ACTOR_DP=2
ACTOR_TP=1
python -m verl.trainer.main_ppo
algorithm.adv_estimator=grpo
trainer.val_before_train=False
data.train_files=${TRAIN_SET}
data.val_files=${VAL_SET}
data.train_batch_size=256
data.max_prompt_length=512
data.max_response_length=4096
data.filter_overlong_prompts=True
data.truncation='error'
data.shuffle=True
actor_rollout_ref.model.path=${MODEL}
actor_rollout_ref.model.lora_rank=64
actor_rollout_ref.model.lora_alpha=32
actor_rollout_ref.actor.optim.lr=5e-6
actor_rollout_ref.model.use_remove_padding=True
actor_rollout_ref.actor.ppo_mini_batch_size=256
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=64
actor_rollout_ref.actor.use_kl_loss=True
actor_rollout_ref.actor.kl_loss_coef=0.001
actor_rollout_ref.actor.kl_loss_type=low_var_kl
actor_rollout_ref.actor.entropy_coeff=0
actor_rollout_ref.model.enable_gradient_checkpointing=True
actor_rollout_ref.actor.fsdp_config.param_offload=False
actor_rollout_ref.actor.fsdp_config.optimizer_offload=False
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=64
actor_rollout_ref.rollout.tensor_model_parallel_size=${ACTOR_TP}
actor_rollout_ref.rollout.data_parallel_size=${ACTOR_DP}
actor_rollout_ref.rollout.name=vllm
actor_rollout_ref.rollout.gpu_memory_utilization=0.75
actor_rollout_ref.rollout.n=5
actor_rollout_ref.rollout.load_format=safetensors
actor_rollout_ref.rollout.layered_summon=True
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=64
actor_rollout_ref.ref.fsdp_config.param_offload=False
actor_rollout_ref.rollout.mode=sync
trainer.logger='["console", "wandb"]'
algorithm.use_kl_in_reward=False
trainer.critic_warmup=0
+ray_kwargs.ray_init.log_to_driver=true
trainer.project_name='veRL'
trainer.experiment_name='exp_name'
trainer.n_gpus_per_node=2
trainer.nnodes=1
trainer.save_freq=200
trainer.test_freq=10
trainer.total_epochs=3 \
Expected behavior
I noticed that the infer_tp size defined in
"
verl/verl/workers/fsdp_workers.py
Line 589 in 1ae510c
| infer_tp = self.config.rollout.tensor_model_parallel_size * self.config.rollout.data_parallel_size |
may be set incorrectly.
I conducted experiments using the original implementation and a modified version where
infer_tp = self.config.rollout.tensor_model_parallel_sizeThe results show that the original implementation suffers from significant efficiency degradation.
In addition, I printed the prompts received by each DP worker and found that, under the original setting, prompts were not dispatched to the DP workers at all.
The results shows below.
origin
�[36m(WorkerDict pid=455057)�[0m [Worker Info] Global Rank: 0, World Size: 2
�[36m(WorkerDict pid=455057)�[0m [Prompts Info] Number of prompts (before repeat): 256
�[36m(WorkerDict pid=455057)�[0m [Repeat Info] repeat_times: 5, repeat_interleave: True
�[36m(WorkerDict pid=455058)�[0m [Worker Info] Global Rank: 1, World Size: 2
�[36m(WorkerDict pid=455058)�[0m [Prompts Info] Number of prompts (before repeat): 256
�[36m(WorkerDict pid=455058)�[0m [Repeat Info] repeat_times: 5, repeat_interleave: True
�[36m(WorkerDict pid=455058)�[0m [Prompts Info] Number of prompts (after repeat): 1280
- response_length_non_aborted/mean:304.8968811035156 - response_length_non_aborted/max:4096.0 - response_length_non_aborted/min:129.0 - response_length_non_aborted/clip_ratio:0.0007812500116415322 - response/aborted_ratio:0.0 - prompt_length/mean:104.4921875 - prompt_length/max:183.0 - prompt_length/min:69.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:0.0005315960152074695 - timing_s/generate_sequences:76.94253540039062 - timing_s/generation_timing/max:93.05921936035156 - timing_s/generation_timing/min:60.82584762573242 - timing_s/generation_timing/topk_ratio:0.5 - timing_s/gen:99.35007062903605 - timing_s/reward:0.3626377400942147 - timing_s/old_log_prob:8.188723739935085 - timing_s/ref:4.42419486597646 - timing_s/adv:0.04897050198633224 - timing_s/update_actor:26.710124182980508 - timing_s/step:139.3436508589657 - timing_s/stop_profile:0.0001778359292075038 - timing_per_token_ms/gen:0.2545688363612596 - timing_per_token_ms/adv:9.34519462811053e-05 - timing_per_token_ms/ref:0.008442829952361293 - timing_per_token_ms/update_actor:0.0509717684945565 - perf/total_num_tokens:524018 - perf/time_per_step:139.3436508589657 - perf/throughput:1880.308133057228
infer_tp = self.config.rollout.tensor_model_parallel_size
�[36m(WorkerDict pid=344971)�[0m [Worker Info] Global Rank: 0, World Size: 2
�[36m(WorkerDict pid=344971)�[0m [Prompts Info] Number of prompts (before repeat): 128
�[36m(WorkerDict pid=344971)�[0m [Repeat Info] repeat_times: 5, repeat_interleave: True
�[36m(WorkerDict pid=344971)�[0m [Prompts Info] Number of prompts (after repeat): 640
�[36m(WorkerDict pid=344972)�[0m [Worker Info] Global Rank: 1, World Size: 2
�[36m(WorkerDict pid=344972)�[0m [Prompts Info] Number of prompts (before repeat): 128
�[36m(WorkerDict pid=344972)�[0m [Repeat Info] repeat_times: 5, repeat_interleave: True
�[36m(WorkerDict pid=344972)�[0m [Prompts Info] Number of prompts (after repeat): 640
- response_length_non_aborted/mean:292.1937561035156 - response_length_non_aborted/max:4096.0 - response_length_non_aborted/min:111.0 - response_length_non_aborted/clip_ratio:0.0023437500931322575 - response/aborted_ratio:0.0 - prompt_length/mean:101.9140625 - prompt_length/max:199.0 - prompt_length/min:70.0 - prompt_length/clip_ratio:0.0 - timing_s/start_profile:6.126496009528637e-05 - timing_s/generate_sequences:42.80970001220703 - timing_s/generation_timing/max:43.4930305480957 - timing_s/generation_timing/min:42.12636947631836 - timing_s/generation_timing/topk_ratio:0.5 - timing_s/gen:48.79045281698927 - timing_s/reward:0.4484845600090921 - timing_s/old_log_prob:5.699306814931333 - timing_s/ref:4.0395765529247 - timing_s/adv:0.19046411104500294 - timing_s/update_actor:20.035540578886867 - timing_s/step:79.389909783029 - timing_s/stop_profile:0.0001710229553282261 - timing_per_token_ms/adv:0.00037756188036467444 - timing_per_token_ms/update_actor:0.03971696470050404 - timing_per_token_ms/gen:0.13045296575738827 - timing_per_token_ms/ref:0.008007755953765626 - perf/total_num_tokens:504458 - perf/time_per_step:79.389909783029 - perf/throughput:3177.0914048061863