-
Notifications
You must be signed in to change notification settings - Fork 127
Open
Description
在vllm加载更新模型参数的时候遇到问题:“start (0) + length (1536) exceeds dimension size (0).”
在源代码中,模型参数通过“state_dict = engine.module.state_dict()”获取
而gen_worker中会调用try_update_model函数来更新参数。原函数如下:
def try_update_model():
try:
new_state_dict = Q.get_nowait()
print('[VLLM PROC] recving new model ...')
llm_model = vllm_gen.llm_engine.model_executor.driver_worker.model_runner.model
llm_model.load_weights(new_state_dict.items())
print('[VLLM PROC] model updated')
del new_state_dict
except:
#print('[VLLM PROC] no new model')
return我对该函数进行如下修改:
def try_update_model():
try:
new_state_dict = Q.get_nowait()
print('[VLLM PROC] recving new model ...')
llm_model = vllm_gen.llm_engine.model_executor.driver_worker.model_runner.model
llm_model.load_weights(new_state_dict.items())
print('[VLLM PROC] model updated')
del new_state_dict
except Empty:
print(f"\033[33mempty\033[0m")
return
except Exception as e:
print(f"\033[31m{e}\033[0m")
return之后会打屏start (0) + length (1536) exceeds dimension size (0).这个问题。我在想源代码是不是也会遇到这个问题,只不过except中直接return不影响,导致采样模型一直没有更新?
我的库版本如下:
- torch 2.4.0
- vllm 0.6.3.post1
- deepspeed 0.16.7
- transformers 4.51.3 (4.47.1也试过)
- xformers 0.0.27.post2
我的CUDA是12.2,训练模型是Qwen2.5-1.5B模型。
作者有遇到过这个问题吗?如何解决的?非常期待您的回复,感谢!!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels