Pretrain阶段 Text Packing 数据 label_ids 设定问题 #29
zykRichard
announced in
Announcements
Replies: 1 comment
-
|
没有必要这么搞 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
pretrain时数据按照Text Packing处理,跨文档情况的label_ids似乎没做语义对齐的处理?
例如,这一批数据是:[A0, A1, A2, <//s>, B0, B1, <//s>, C1],那么:
input_ids : [A0, A1, A2, <//s>, B0, B1, <//s>, C1]
label_ids: [A1, A2, <//s>, -100, B1, <//s>, -100]
(不同字母代表不同文本)
即跨文档的第一个token应该设置成ignore index才是合理的?
目前我没有在原文件中找到这部分逻辑 (utils.py 中的 pretrain_collate_fn, llm_trainer中loss.py的LMloss forward没有相对应的逻辑)
请问是在别的地方做了相对应的处理吗?或者说不需要这么处理呢?如果不需要这么处理想请教一下原因,谢谢!
Beta Was this translation helpful? Give feedback.
All reactions