-
Notifications
You must be signed in to change notification settings - Fork 810
Update template_meta.prefix bug #4813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Can you modify only the |
你在 #4806 说的是 将不满足条件的信息过滤掉,但这是不合理的做法,因为添加bos这类token是常规的需求,而_swift_encode中返回的元素是list 类型的恰恰bos token这些token,这里的核心原因是template_meta.prefix提前转成了token_id(如32013), 而这里应该是token(如'<|begin▁of▁sentence|>'), 如果将token_id直接过滤掉,就把需要添加的bos token过滤掉了,是不符合需求的。 如果将res_context_list的元素(类型是list),那意味着这是token_id list,将其拼接到prompt里边(将int值强行转成str),那后面做encode的时候再对token_id做一次encode,就会导致结果不一致。 综上,不管是在_apply_chat_template_to_messages_list直接过滤掉非str类型的元素,还是将list元素强行转成str都是不合理。
|
res_context_list.append(context) | ||
res_context_type.append(ContextType.OTHER) | ||
if isinstance(context, list) and isinstance(context[0], int): | ||
context = tokenizer.decode(context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decode is not valid here.
The special tokens of a few models have no valid string outputs, or the outputs may be empty
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The res_context_list is a mix of strings and list of integers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
“The res_context_list is a mix of strings and list of integers” has two quesion:
- if the elment of res_context_list is list of Integers, the code
prompts_text.append(''.join(res_context_list))
will raise an exception. - if the elment of res_context_list is list of Integers, then use
prompts_text.append(''.join(res_context_list))
merge to prompt that prompt type is str, you will get an exception when execute encode or you will get uncorrent result when encode because Integers is already token_id, not str.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decode is not valid here. The special tokens of a few models have no valid string outputs, or the outputs may be empty
some models has no valid string outputs, that means the prompt need not special token, it is also ok.
maybe wo can modify the code to
if isinstance(context, list) and isinstance(context[0], int):
context = tokenizer.decode(context)
if context:
res_context_list.append(context)
res_context_type.append(ContextType.OTHER)
PR type
使用deepseek-ai/deepseek-coder-6.7b-base模型做grpo, 发现template_meta_prefix error,issue
问题表现同 issue
当前错误代码错误在swift/llm/template/base.py的_swift_encode函数,涉及的代码行如下:
`
`
PR information
debug发现prefix是 [[32013]], 对应的token是‘<|begin▁of▁sentence|>’,导致这个问题的原因在swift/llm/template/template_meta.py的init()函数的执行,将prefix这类值转成tokenid后替换该属性的值。
为了适配完整而正确的prompt,需要将token_id=32013,decode回‘<|begin▁of▁sentence|>’,这样行程的prompt为:‘<|begin▁of▁sentence|>User:xx\nAssitant:xx’
代码是最新的main分支的代码:swift.version: 3.6.0.dev0
解决方法:
将prefix=[[32013]],返回成token拼接到prompt中,注意这是一个简单的方法。
要彻底采用通用方案解决的话应该在swift/llm/template/template_meta.py文件中保存prefix的token值,如有必要也保存prefix的token_id值,其他如suffix类似,为了防止直接修改带来不可预知的问题,这里采用最小修改的办法先解决该问题。