自制数据集微调报编码错误UnicodeEncodeError:'gbk' codec can't encode character '\uffd' in position 18 #813
Replies: 3 comments
-
有的中文和符号就是超过了utf8编码集了,固定不行的 |
Beta Was this translation helpful? Give feedback.
0 replies
-
好像跟这个没关系。当文件中含有中文字符时,保存jsonl文件时要加上encoding="GBK" |
Beta Was this translation helpful? Give feedback.
0 replies
-
我写了一个py程序,从Excel读取数据,用encoding='utf-8',来生成train.json文件 def process_excel(file_path, output_json_file):
excel_file_path = '测试.xlsx' process_excel(excel_file_path, output_json_file_path) |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
cuda11.8; python3.10; windows 服务器 A40
1.在记事本中创建自己的数据集如chatglm2的形式,content,summary的label
2.使用format_advertise_gen.py更换数据集路径后生成chatglm3的格式,prompt,response的label
3.运行finetune_pt.sh脚本训练自己的数据集
4.报错UnicodeEncodeError:'gbk' codec can't encode character '\uffd' in position 18
1.用chardet检查了自己数据集.jsonl文件,是utf-8编码格式
![Uploading 1.png…]()
2.pycharm解释器中也是utf-8
3.finetune.py中读数据集时也加了utf-8
4.除了上述错误,在尝试解决bug时,偶尔会报错UnicodeDecodeError:'utf-8' codec can't decode byte 0xbd in position 13:invalid start byte
json.decoder.JSONDecodeError:Expecting value:line 2 column 1(char 1)
5.请问关于自己数据集制作时有什么注意事项,以及报此三项编码解码错误的可能原因?
Beta Was this translation helpful? Give feedback.
All reactions