Skip to content

采样过程模型参数更新的问题 #41

@Tbuterin

Description

@Tbuterin

在vllm加载更新模型参数的时候遇到问题:“start (0) + length (1536) exceeds dimension size (0).”
在源代码中,模型参数通过“state_dict = engine.module.state_dict()”获取
gen_worker中会调用try_update_model函数来更新参数。原函数如下:

def try_update_model():
        try:
            new_state_dict = Q.get_nowait()
            print('[VLLM PROC] recving new model ...')
            llm_model = vllm_gen.llm_engine.model_executor.driver_worker.model_runner.model
            llm_model.load_weights(new_state_dict.items())
            print('[VLLM PROC] model updated')
            del new_state_dict
        except:
            #print('[VLLM PROC] no new model')
            return

我对该函数进行如下修改:

def try_update_model():
        try:
            new_state_dict = Q.get_nowait()
            print('[VLLM PROC] recving new model ...')
            llm_model = vllm_gen.llm_engine.model_executor.driver_worker.model_runner.model
            llm_model.load_weights(new_state_dict.items())
            print('[VLLM PROC] model updated')
            del new_state_dict
        except Empty:
            print(f"\033[33mempty\033[0m")
            return
        except Exception as e:
            print(f"\033[31m{e}\033[0m")
            return

之后会打屏start (0) + length (1536) exceeds dimension size (0).这个问题。我在想源代码是不是也会遇到这个问题,只不过except中直接return不影响,导致采样模型一直没有更新?

我的库版本如下:

  • torch 2.4.0
  • vllm 0.6.3.post1
  • deepspeed 0.16.7
  • transformers 4.51.3 (4.47.1也试过)
  • xformers 0.0.27.post2

我的CUDA是12.2,训练模型是Qwen2.5-1.5B模型。

作者有遇到过这个问题吗?如何解决的?非常期待您的回复,感谢!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions