Skip to content

Commit 81a15ed

Browse files
revert: "Use Mcore GPTModel" (verl-project#883)
Reverts verl-project#706 temporarily as it breaks CI https://github.com/volcengine/verl/actions/runs/14220739954/attempts/2 ``` (TaskRunner pid=10086) 'Initial validation metrics: {}' (TaskRunner pid=10086) step:0 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (TaskRunner pid=10086) validation generation end (TaskRunner pid=10086) [prompt] You are an AI programming assistant, utilizing the Deepseek Coder model, developed by Deepseek Company, and you only answer questions related to computer science. For politically sensitive questions, security and privacy issues, and other non-computer science questions, you will refuse to answer (TaskRunner pid=10086) ### Instruction: (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [02:39<05:18, 159.11s/it] (WorkerDict pid=18977) /root/miniconda3/lib/python3.10/site-packages/torch/autograd/graph.py:768: UserWarning: c10d::broadcast_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at ../torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.) [repeated 7x across cluster] (WorkerDict pid=18977) return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [repeated 7x across cluster] (TaskRunner pid=10086) Training Progress: 33%|███▎ | 1/3 [04:51<09:43, 291.93s/it] (WorkerDict pid=18980) [rank4]:[E402 16:49:38.988158820 ProcessGroupNCCL.cpp:1515] [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame verl-project#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame verl-project#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,666 E 18980 20767] logging.cc:97: Unhandled exception: N3c1016DistBackendErrorE. what(): [PG 97 Rank 0] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:43 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fc6e4126d10 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fc6e4594f08 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) (WorkerDict pid=18980) frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x56 (0x7fc6927d2a56 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0xa0 (0x7fc6927d7c70 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#5: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1da (0x7fc6927de92a in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc6927e0d6c in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame verl-project#7: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame verl-project#8: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame verl-project#9: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) Exception raised from ncclCommWatchdog at ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first): (WorkerDict pid=18980) frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fc6e4177f86 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libc10.so) (WorkerDict pid=18980) frame #1: <unknown function> + 0xe1a5e4 (0x7fc6924625e4 in /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) (WorkerDict pid=18980) frame #2: <unknown function> + 0xdbbf4 (0x7fc9fd477bf4 in /root/miniconda3/bin/../lib/libstdc++.so.6) (WorkerDict pid=18980) frame #3: <unknown function> + 0x94ac3 (0x7fc9ff2f0ac3 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) frame verl-project#4: clone + 0x44 (0x7fc9ff381a04 in /usr/lib/x86_64-linux-gnu/libc.so.6) (WorkerDict pid=18980) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:104: Stack trace: (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe543a) [0x7fc9fe5a143a] ray::operator<<() (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/ray/_raylet.so(+0xfe7b78) [0x7fc9fe5a3b78] ray::TerminateHandler() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc9fd44d35a] __cxxabiv1::__terminate() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc9fd44d3c5] (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc9fd44d34f] (WorkerDict pid=18980) /root/miniconda3/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so(+0xe1a695) [0x7fc692462695] c10d::ProcessGroupNCCL::ncclCommWatchdog() (WorkerDict pid=18980) /root/miniconda3/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc9fd477bf4] execute_native_thread_routine (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x7fc9ff2f0ac3] (WorkerDict pid=18980) /usr/lib/x86_64-linux-gnu/libc.so.6(clone+0x44) [0x7fc9ff381a04] __clone (WorkerDict pid=18980) (WorkerDict pid=18980) *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: *** SIGABRT received at time=1743612578 on cpu 118 *** (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: PC: @ 0x7fc9ff2f29fc (unknown) pthread_kill (WorkerDict pid=18980) [2025-04-02 16:49:38,675 E 18980 20767] logging.cc:361: @ 0x7fc9ff29e520 (unknown) (unknown) (WorkerDict pid=18980) Fatal Python error: Aborted (WorkerDict pid=18980) (WorkerDict pid=18980) (WorkerDict pid=18980) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, _brotli, zstandard.backend_c, uvloop.loop, ray._raylet, numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pyarrow.lib, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pyarrow._compute, pandas._libs.ops, pandas._libs.hashing, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.internals, pandas._libs.indexing, pandas._libs.index, pandas._libs.writers, pandas._libs.join, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, markupsafe._speedups, PIL._imaging, msgspec._core, sentencepiece._sentencepiece, PIL._imagingft, regex._regex, multidict._multidict, yarl._helpers_c, yarl._quoting_c, aiohttp._helpers, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket, frozenlist._frozenlist, pyarrow._json, zmq.backend.cython.context, zmq.backend.cython.message, zmq.backend.cython.socket, zmq.backend.cython._device, zmq.backend.cython._poll, zmq.backend.cython._proxy_steerable, zmq.backend.cython._version, zmq.backend.cython.error, zmq.backend.cython.utils (total: 96) Error executing job with overrides: ['algorithm.adv_estimator=gae', 'data.train_files=/github/home/data/gsm8k/train.parquet', 'data.val_files=/github/home/data/gsm8k/test.parquet', 'data.train_batch_size=1024', 'data.max_prompt_length=512', 'data.max_response_length=512', 'actor_rollout_ref.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'actor_rollout_ref.actor.optim.lr=2e-6', 'actor_rollout_ref.actor.ppo_mini_batch_size=256', 'actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4', 'actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.actor.megatron.tensor_model_parallel_size=4', 'actor_rollout_ref.actor.use_kl_loss=False', 'actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8', 'actor_rollout_ref.rollout.tensor_model_parallel_size=2', 'actor_rollout_ref.rollout.name=vllm', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.5', 'actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16', 'actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.virtual_pipeline_model_parallel_size=2', 'actor_rollout_ref.ref.megatron.tensor_model_parallel_size=2', 'critic.optim.lr=2e-5', 'critic.model.path=/github/home/models/deepseek-ai/deepseek-coder-1.3b-instruct', 'critic.model.enable_gradient_checkpointing=False', 'critic.ppo_micro_batch_size_per_gpu=4', 'critic.megatron.pipeline_model_parallel_size=2', 'critic.megatron.virtual_pipeline_model_parallel_size=2', 'critic.megatron.tensor_model_parallel_size=2', 'algorithm.use_kl_in_reward=True', 'algorithm.kl_penalty=kl', 'algorithm.kl_ctrl.kl_coef=0.001', 'trainer.critic_warmup=0', 'trainer.logger=[console]', 'trainer.project_name=verl_megatron_gsm8k_examples', 'trainer.experiment_name=deepseek_llm_1b3_function_rm', 'trainer.n_gpus_per_node=8', 'trainer.nnodes=1', 'trainer.save_freq=-1', 'trainer.test_freq=1', 'trainer.total_epochs=15', 'trainer.total_training_steps=3'] (TaskRunner pid=10086) Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market? Let's think step by step and output the final answer after "####". (TaskRunner pid=10086) ### Response: (TaskRunner pid=10086) (TaskRunner pid=10086) [response] I'm sorry, but as an AI programming assistant, I'm specialized in answering questions related to computer science. I'm not equipped to provide answers to questions about economics or business calculations. I recommend using a calculator or a business-oriented tool for this type of question. (TaskRunner pid=10086) (TaskRunner pid=10086) [ground_truth] 18 (TaskRunner pid=10086) [score] 0.0 (TaskRunner pid=10086) step:1 - global_seqlen/min:[486](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:487)35.000 - global_seqlen/max:51694.000 - global_seqlen/minmax_diff:3059.000 - global_seqlen/balanced_min:49636.000 - global_seqlen/balanced_max:49637.000 - global_seqlen/mean:49636.125 - actor/reward_kl_penalty:0.000 - actor/reward_kl_penalty_coeff:0.001 - critic/vf_loss:0.015 - critic/vf_clipfrac:0.001 - critic/vpred_mean:0.007 - perf/mfu/critic:0.105 - actor/entropy_loss:0.550 - actor/pg_loss:-0.000 - actor/pg_clipfrac:0.018 - actor/ppo_kl:0.000 - actor/pg_clipfrac_lower:0.000 - perf/mfu/actor:0.106 - critic/score/mean:0.000 - critic/score/max:0.000 - critic/score/min:0.000 - critic/rewards/mean:0.000 - critic/rewards/max:0.000 - critic/rewards/min:0.000 - critic/advantages/mean:-0.000 - critic/advantages/max:4.994 - critic/advantages/min:-5.666 - critic/returns/mean:-0.000 - critic/returns/max:0.000 - critic/returns/min:-0.000 - critic/values/mean:-0.164 - critic/values/max:0.785 - critic/values/min:-1.000 - critic/vf_explained_var:-2803.085 - response_length/mean:239.112 - response_length/max:512.000 - response_length/min:11.000 - response_length/clip_ratio:0.029 - prompt_length/mean:148.670 - prompt_length/max:275.000 - prompt_length/min:106.000 - prompt_length/clip_ratio:0.000 - timing_s/gen:18.608 - timing_s/old_log_prob:15.249 - timing_s/ref:14.[488](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:489) - timing_s/values:16.315 - timing_s/adv:0.264 - timing_s/update_critic:33.651 - timing_s/update_actor:33.472 - timing_s/testing:25.497 - timing_s/step:157.587 - timing_per_token_ms/adv:0.001 - timing_per_token_ms/gen:0.076 - timing_per_token_ms/update_actor:0.084 - timing_per_token_ms/values:0.041 - timing_per_token_ms/update_critic:0.085 - timing_per_token_ms/ref:0.036 - perf/total_num_tokens:397089.000 - perf/time_per_step:157.587 - perf/throughput:314.976 (TaskRunner pid=10086) list(reward_extra_infos_dict.keys())=[] (TaskRunner pid=10086) test_gen_batch meta info: {'eos_token_id': 32021, 'pad_token_id': 32014, 'recompute_log_prob': False, 'do_sample': False, 'validate': True} (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Failed to pickle inputs of failed execution: CUDA error: an illegal memory access was encountered (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. (WorkerDict pid=18980) WARNING 04-02 16:49:38 model_runner_base.py:143] Traceback (most recent call last): File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 54, in main run_ppo(config) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 72, in run_ppo ray.get(runner.run.remote(config)) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper return fn(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper return func(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 2667, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) File "/root/miniconda3/lib/python3.10/site-packages/ray/_private/worker.py", line 864, in get_objects raise value.as_instanceof_cause() ray.exceptions.RayTaskError(RuntimeError): ray::TaskRunner.run() (pid=10086, ip=172.20.0.2, actor_id=11bc451866f5759f3a7f540[501](https://github.com/volcengine/verl/actions/runs/14220739954/job/39861249946#step:6:502)000000, repr=<main_ppo.TaskRunner object at 0x7fd00c61a110>) File "/data00/tiger/huggingface/verl/verl/verl/trainer/main_ppo.py", line 184, in run trainer.fit() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 950, in fit val_metrics: dict = self._validate() File "/data00/tiger/huggingface/verl/verl/verl/trainer/ppo/ray_trainer.py", line 545, in _validate test_output_gen_batch_padded = self.actor_rollout_wg.generate_sequences(test_gen_batch_padded) File "/data00/tiger/huggingface/verl/verl/verl/single_controller/ray/base.py", line 42, in func output = ray.get(output) ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_generate_sequences() (pid=18980, ip=172.20.0.2, actor_id=4f21075809bd462a5907ebea01000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7fc62ae1ce20>) File "/root/miniconda3/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1708, in execute_model output: SamplerOutput = self.model.sample( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 571, in sample next_tokens = self.sampler(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl return forward_call(*args, **kwargs) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 231, in forward self._init_sampling_tensors(logits, sampling_metadata) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/sampler.py", line 195, in _init_sampling_tensors do_min_p) = SamplingTensors.from_sampling_metadata( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 471, in from_sampling_metadata sampling_tensors = SamplingTensors.from_lists( File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/sampling_metadata.py", line 529, in from_lists temperatures_t = torch.tensor( RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. ```
1 parent 233c11c commit 81a15ed

File tree

15 files changed

+296
-1494
lines changed

15 files changed

+296
-1494
lines changed

scripts/model_merger.py

Lines changed: 139 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -238,129 +238,156 @@ def process_one_shard(shard_dir):
238238
if args.test:
239239
ref_state_dict = load_file(os.path.join(args.test_hf_dir, 'model.safetensors'))
240240

241-
def merge_across_tp(key, tp_data):
242-
if "linear_fc1.weight" in key:
243-
# if the tensor is gate and proj
244-
gate_lst = []
245-
up_lst = []
246-
for infer_param in tp_data:
247-
gate, up = infer_param.chunk(2)
248-
gate_lst.append(gate)
249-
up_lst.append(up)
250-
gate = torch.cat(gate_lst, dim=0)
251-
up = torch.cat(up_lst, dim=0)
252-
tp_data = [gate, up]
253-
elif "self_attention.linear_qkv." in key and 'layer_norm' not in key:
254-
# if the tensor is qkv, for each param on tp, split into q, k, v
255-
# concat q, k, v separately.
256-
q_lst = []
257-
k_lst = []
258-
v_lst = []
259-
assert config.num_attention_heads % config.num_key_value_heads == 0
260-
num_q_per_kv = config.num_attention_heads // config.num_key_value_heads
261-
assert tp_data[0].shape[0] % (num_q_per_kv + 2) == 0
262-
kv_size_per_tp = tp_data[0].shape[0] // (num_q_per_kv + 2)
263-
split_size = [kv_size_per_tp * num_q_per_kv, kv_size_per_tp, kv_size_per_tp]
264-
for infer_param in tp_data:
265-
num_query_groups_per_partition = config.num_key_value_heads // tp_size
266-
for chunk in infer_param.chunk(num_query_groups_per_partition):
267-
split_size = [
268-
kv_size_per_tp * num_q_per_kv // num_query_groups_per_partition,
269-
kv_size_per_tp // num_query_groups_per_partition,
270-
kv_size_per_tp // num_query_groups_per_partition
271-
]
272-
q, k, v = chunk.split(split_size)
273-
q_lst.append(q)
274-
k_lst.append(k)
275-
v_lst.append(v)
276-
q = torch.cat(q_lst, dim=0)
277-
k = torch.cat(k_lst, dim=0)
278-
v = torch.cat(v_lst, dim=0)
279-
280-
tp_data = [q,k,v]
281-
282-
elif "layer_norm" in key or "layernorm" in key or "output_layer" in key and args.is_value_model:
283-
tp_data = tp_data[0]
241+
def handle_qkv_proj(key, config, tensor, state_dict):
242+
nonlocal tp_size
243+
244+
hidden_size_per_head = config.hidden_size // config.num_attention_heads
245+
246+
if config.num_key_value_heads >= tp_size:
247+
q_size_tp = config.hidden_size // tp_size
248+
kv_size_tp = hidden_size_per_head * config.num_key_value_heads // tp_size
249+
total_size = q_size_tp + 2 * kv_size_tp
250+
q_part = tensor[:q_size_tp]
251+
k_part = tensor[q_size_tp:q_size_tp + kv_size_tp]
252+
v_part = tensor[q_size_tp + kv_size_tp:total_size]
253+
else:
254+
q_size_tp = config.hidden_size // tp_size
255+
kv_size_tp = hidden_size_per_head
256+
total_size = q_size_tp + 2 * kv_size_tp
257+
q_part = tensor[:q_size_tp]
258+
k_part = tensor[q_size_tp:q_size_tp + kv_size_tp]
259+
v_part = tensor[q_size_tp + kv_size_tp:total_size]
260+
261+
preffix = '.'.join(key.split('.')[:4])
262+
suffix = '.'.join(key.split('.')[5:])
263+
if state_dict.get(f'{preffix}.q_proj.{suffix}') is None:
264+
state_dict[f'{preffix}.q_proj.{suffix}'] = q_part
284265
else:
285-
dim = 0
286-
if "linear_fc2.weight" in key or "self_attention.linear_proj" in key:
287-
dim = 1
288-
tp_data = torch.cat(tp_data, dim=dim)
266+
state_dict[f'{preffix}.q_proj.{suffix}'] = torch.concat([state_dict[f'{preffix}.q_proj.{suffix}'], q_part], dim=0)
267+
if state_dict.get(f'{preffix}.k_proj.{suffix}') is None:
268+
state_dict[f'{preffix}.k_proj.{suffix}'] = k_part
269+
else:
270+
state_dict[f'{preffix}.k_proj.{suffix}'] = torch.concat([state_dict[f'{preffix}.k_proj.{suffix}'], k_part], dim=0)
271+
if state_dict.get(f'{preffix}.v_proj.{suffix}') is None:
272+
state_dict[f'{preffix}.v_proj.{suffix}'] = v_part
273+
else:
274+
state_dict[f'{preffix}.v_proj.{suffix}'] = torch.concat([state_dict[f'{preffix}.v_proj.{suffix}'], v_part], dim=0)
289275

276+
return state_dict
290277

291-
return tp_data
278+
def handle_gate_up_proj(key, config, tensor, state_dict):
279+
nonlocal tp_size
280+
281+
intermediate_size_tp = config.intermediate_size // tp_size
282+
gate_weight_tp = tensor[:intermediate_size_tp]
283+
up_weight_tp = tensor[intermediate_size_tp:]
284+
preffix = '.'.join(key.split('.')[:4])
285+
suffix = '.'.join(key.split('.')[5:])
286+
if state_dict.get(f'{preffix}.gate_proj.{suffix}') is None:
287+
state_dict[f'{preffix}.gate_proj.{suffix}'] = gate_weight_tp
288+
else:
289+
state_dict[f'{preffix}.gate_proj.{suffix}'] = torch.concat([state_dict[f'{preffix}.gate_proj.{suffix}'], gate_weight_tp], dim=0)
290+
if state_dict.get(f'{preffix}.up_proj.{suffix}') is None:
291+
state_dict[f'{preffix}.up_proj.{suffix}'] = up_weight_tp
292+
else:
293+
state_dict[f'{preffix}.up_proj.{suffix}'] = torch.concat([state_dict[f'{preffix}.up_proj.{suffix}'], up_weight_tp], dim=0)
294+
295+
return state_dict
296+
297+
def merge_between_tp_rank(key, model_state_dict):
298+
nonlocal state_dict
299+
300+
try:
301+
tensor = model_state_dict.pop(key)
302+
except:
303+
raise RuntimeError(f"error pop: {key}")
304+
# Embedding layer
305+
if "model.embed_tokens.weight" in key:
306+
if state_dict[key] is None:
307+
state_dict[key] = tensor
308+
else:
309+
state_dict[key] = torch.concat([state_dict[key], tensor], dim=0)
310+
return state_dict
311+
# Tranformer Layers
312+
if "input_layernorm.weight" in key:
313+
if state_dict[key] is None:
314+
state_dict[key] = tensor
315+
return state_dict
316+
if re.search(r"self_attn\.qkv_proj", key):
317+
state_dict = handle_qkv_proj(key, config, tensor, state_dict)
318+
return state_dict
319+
if "self_attn.o_proj.weight" in key:
320+
if state_dict[key] is None:
321+
state_dict[key] = tensor
322+
else:
323+
state_dict[key] = torch.concat([state_dict[key], tensor], dim=1)
324+
return state_dict
325+
if "post_attention_layernorm.weight" in key:
326+
if state_dict[key] is None:
327+
state_dict[key] = tensor
328+
return state_dict
329+
if re.search(r"mlp\.gate_up_proj\.weight", key):
330+
state_dict = handle_gate_up_proj(key, config, tensor, state_dict)
331+
return state_dict
332+
if "mlp.down_proj.weight" in key:
333+
if state_dict[key] is None:
334+
state_dict[key] = tensor
335+
else:
336+
state_dict[key] = torch.concat([state_dict[key], tensor], dim=1)
337+
return state_dict
338+
# Final LayerNorm
339+
if "model.norm.weight" in key:
340+
if state_dict[key] is None:
341+
state_dict[key] = tensor
342+
return state_dict
343+
if not args.tie_word_embedding:
344+
if args.is_value_model:
345+
if "lm_head.weight" in key:
346+
if state_dict[key] is None:
347+
state_dict[key] = tensor
348+
if "reward_head.weight" in key:
349+
if state_dict[key] is None:
350+
state_dict[key] = tensor
351+
else:
352+
if "lm_head.weight" in key:
353+
if state_dict[key] is None:
354+
state_dict[key] = tensor
355+
else:
356+
state_dict[key] = torch.concat([state_dict[key], tensor], dim=0)
357+
return state_dict
358+
return state_dict
292359

293-
vpp_size = len(model_state_dict_lst[0][0])
294-
layers_cum = 0
295-
for vpp_rank in range(vpp_size):
296-
for pp_rank in range(pp_size):
297-
layers_handled = 0
298-
keys = model_state_dict_lst[pp_rank][0][vpp_rank].keys()
360+
for pp_rank in range(pp_size):
361+
print(f'pp_rank: {pp_rank}')
362+
for vpp_rank, state_dict_single_layer in enumerate(model_state_dict_lst[pp_rank][0]):
363+
state_dict_single_layer_iter = state_dict_single_layer.copy()
364+
keys = state_dict_single_layer_iter.keys()
299365
for key in keys:
300366
if "extra_state" in key:
301367
continue
302-
if args.tie_word_embedding and ("output_layer" in key):
368+
if args.tie_word_embedding and ("lm_head" in key or "reward_head" in key):
303369
print(f'skip lm_head and reward_head loading because of tie_word_embeddings')
304370
continue
305-
new_key = key
306-
if "decoder.layers." in key:
307-
local_layer_no = int(key.split('.')[2])
308-
layers_handled = max(local_layer_no, layers_handled)
309-
global_layer_no = local_layer_no + layers_cum
310-
new_key_list=key.split('.')
311-
new_key_list[2] = str(global_layer_no)
312-
new_key = '.'.join(new_key_list)
313-
314-
tp_data = [model_state_dict_lst[pp_rank][tp_rank][vpp_rank][key] for tp_rank in range(tp_size)]
315-
merged = merge_across_tp(new_key, tp_data)
316-
if not isinstance(merged,list):
317-
state_dict[new_key] = merged
318-
elif len(merged)==3:
319-
# split qkv
320-
for n,d in zip(['q','k','v'], merged):
321-
state_dict[new_key.replace("linear_qkv",f"linear_{n}")] = d
322-
elif len(merged)==2:
323-
# split gate up
324-
state_dict[new_key.replace("linear_fc1","gate_proj")] = merged[0]
325-
state_dict[new_key.replace("linear_fc1","up_proj")] = merged[1]
326-
layers_cum += layers_handled+1 # zero based
327-
328-
del model_state_dict_lst
329-
330-
params_mapping = [
331-
# (megatron core gpt model name, vllm model name)
332-
("self_attention.linear_qkv.layer_norm_weight", "input_layernorm.weight"),
333-
("self_attention.linear_qkv.layer_norm_bias", "input_layernorm.bias"),
334-
("embedding.word_embeddings", "model.embed_tokens"),
335-
("self_attention.linear_qkv", "self_attn.qkv_proj"),
336-
("self_attention.linear_proj", "self_attn.o_proj"),
337-
("pre_mlp_layernorm", "post_attention_layernorm"),
338-
("mlp.linear_fc1.layer_norm_weight", "post_attention_layernorm.weight"),
339-
("mlp.linear_fc1.layer_norm_bias", "post_attention_layernorm.bias"),
340-
("mlp.linear_fc1", "mlp.gate_up_proj"),
341-
("mlp.linear_fc2", "mlp.down_proj"),
342-
("decoder.final_layernorm", "model.norm"),
343-
("output_layer", "lm_head"),
344-
("self_attention.linear_q", "self_attn.q_proj"),
345-
("self_attention.linear_k", "self_attn.k_proj"),
346-
("self_attention.linear_v", "self_attn.v_proj"),
347-
]
371+
if re.search(r"self_attn\.qkv_proj", key) is None and re.search(r"gate_up_proj", key) is None:
372+
state_dict[key] = None
373+
for tp_rank in range(tp_size):
374+
model_state_dict = model_state_dict_lst[pp_rank][tp_rank][vpp_rank]
375+
state_dict = merge_between_tp_rank(key, model_state_dict)
348376

377+
del model_state_dict_lst
349378
if args.test:
350-
351-
for original_name, loaded_weight in state_dict.items():
352-
name = _replace_name(original_name, params_mapping)
353-
if not name or name.endswith(".bias") and name not in ref_state_dict:
354-
continue
355-
if "rotary_emb.inv_freq" in name:
356-
continue
357-
if args.tie_word_embedding and "lm_head.weight" in name:
358-
continue
359-
if name not in ref_state_dict:
360-
raise RuntimeError(f'key: {name} not exist in state_dict')
361-
param = ref_state_dict[name]
362-
assert loaded_weight.dtype == param.dtype
363-
torch.testing.assert_close(loaded_weight, param, atol=1e-4, rtol=1e-4)
379+
for key, value in state_dict.items():
380+
print(key)
381+
if key not in ref_state_dict:
382+
raise RuntimeError(f'key: {key} not exist in ref_state_dict {value}')
383+
if value.shape != ref_state_dict[key].shape:
384+
raise RuntimeError(f'key: {key} shape mismatch {value.shape}, {ref_state_dict[key].shape}')
385+
assert value.dtype == ref_state_dict[key].dtype, f'{key} state_dict[key].dtype: {value.dtype} != ref_state_dict[key].dtype: {ref_state_dict[key].dtype}'
386+
torch.testing.assert_close(value, ref_state_dict[key], atol=1e-4, rtol=1e-4)
387+
for key in ref_state_dict:
388+
if key not in state_dict:
389+
raise RuntimeError(f'key: {key} not exist in state_dict {ref_state_dict[key]}')
390+
364391

365392
print('Writing to local disk')
366393
if args.target_dir is None:
@@ -388,29 +415,6 @@ def merge_across_tp(key, tp_data):
388415
if args.hf_upload_path:
389416
upload_model_to_huggingface(hf_path)
390417

391-
392-
def _replace_name(megatron_name, name_mapping):
393-
for m_name, v_name in name_mapping:
394-
if m_name not in megatron_name:
395-
continue
396-
if "layers" in megatron_name: # deal with decoder layers
397-
megatron_name = megatron_name.replace("decoder", "model")
398-
megatron_name_list = megatron_name.split(".")
399-
if "layer_norm_weight" in megatron_name_list or "layer_norm_bias" in megatron_name_list:
400-
param_name_list = megatron_name_list[:3]
401-
param_name_list.append(v_name)
402-
param_name = ".".join(param_name_list)
403-
else:
404-
param_name_list = megatron_name_list[:3]
405-
weight_or_bias = megatron_name_list[-1]
406-
param_name_list.append(v_name)
407-
param_name_list.append(weight_or_bias)
408-
param_name = ".".join(param_name_list)
409-
return param_name
410-
else:
411-
param_name = megatron_name.replace(m_name, v_name)
412-
return param_name
413-
414418
if __name__ == '__main__':
415419
if args.backend == "fsdp":
416420
convert_fsdp_checkpoints_to_hfmodels()

verl/models/llama/megatron/layers/parallel_linear.py

Lines changed: 0 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -72,34 +72,3 @@ def __init__(self,
7272
gather_output=gather_output,
7373
skip_bias_add=skip_bias_add,
7474
**kwargs)
75-
76-
77-
import torch
78-
79-
80-
class LinearForLastLayer(torch.nn.Linear):
81-
82-
def __init__(
83-
self,
84-
input_size,
85-
output_size,
86-
*,
87-
config,
88-
bias=True,
89-
):
90-
super().__init__(in_features=input_size, out_features=output_size, bias=bias)
91-
self.sequence_parallel = config.sequence_parallel
92-
if self.sequence_parallel:
93-
setattr(self.weight, 'sequence_parallel', True)
94-
95-
def forward(
96-
self,
97-
input_,
98-
weight=None,
99-
runtime_gather_output=None,
100-
):
101-
logits = super().forward(input_)
102-
logits = logits.float()
103-
if self.sequence_parallel:
104-
logits = tensor_parallel.gather_from_sequence_parallel_region(logits, tensor_parallel_output_grad=False)
105-
return logits, None

verl/models/mcore/__init__.py

Lines changed: 0 additions & 16 deletions
This file was deleted.

0 commit comments

Comments
 (0)