自制数据集微调报编码错误UnicodeEncodeError:'gbk' codec can't encode character '\uffd' in position 18 #813

Dorry-wangxixi · 2024-02-02T03:28:35Z

Dorry-wangxixi
Feb 2, 2024

cuda11.8; python3.10; windows 服务器 A40
1.在记事本中创建自己的数据集如chatglm2的形式，content,summary的label
2.使用format_advertise_gen.py更换数据集路径后生成chatglm3的格式，prompt,response的label
3.运行finetune_pt.sh脚本训练自己的数据集
4.报错UnicodeEncodeError:'gbk' codec can't encode character '\uffd' in position 18

1.用chardet检查了自己数据集.jsonl文件，是utf-8编码格式
2.pycharm解释器中也是utf-8
3.finetune.py中读数据集时也加了utf-8
4.除了上述错误，在尝试解决bug时，偶尔会报错UnicodeDecodeError:'utf-8' codec can't decode byte 0xbd in position 13:invalid start byte
json.decoder.JSONDecodeError:Expecting value:line 2 column 1(char 1)
5.请问关于自己数据集制作时有什么注意事项，以及报此三项编码解码错误的可能原因？

zRzRzRzRzRzRzR · 2024-02-02T06:44:59Z

zRzRzRzRzRzRzR
Feb 2, 2024
Maintainer

有的中文和符号就是超过了utf8编码集了，固定不行的

0 replies

Roych13 · 2024-02-04T13:02:46Z

Roych13
Feb 4, 2024

好像跟这个没关系。当文件中含有中文字符时，保存jsonl文件时要加上encoding="GBK"

0 replies

KevinFanng · 2024-02-06T12:38:14Z

KevinFanng
Feb 6, 2024

我写了一个py程序，从Excel读取数据，用encoding='utf-8'，来生成train.json文件
import pandas as pd
import json

def process_excel(file_path, output_json_file):
# 读取 Excel 文件
xls = pd.ExcelFile(file_path)

# 存储所有数据的列表
all_data = []

……

            # 组成字符串data
            content = f"字段1#{cell_value1}*字段2#{cell_value2}*字段3#{cell_value3}"
            summary = f"字段4#{cell_value4}*字段5#{cell_value5}"
            data = {"content": content, "summary": summary}

            # 添加到列表
            all_data.append(json.dumps(data, ensure_ascii=False))

# 将列表写入 JSON 文件
with open(output_json_file, 'w', encoding='utf-8') as json_file:
    json_file.write("\n".join(all_data))

excel_file_path = '测试.xlsx'
output_json_file_path = 'train_chatgpt.json' # 生成 JSON 文件路径

process_excel(excel_file_path, output_json_file_path)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

自制数据集微调报编码错误UnicodeEncodeError:'gbk' codec can't encode character '\uffd' in position 18 #813

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

自制数据集微调报编码错误UnicodeEncodeError:'gbk' codec can't encode character '\uffd' in position 18 #813

Uh oh!

Dorry-wangxixi Feb 2, 2024

Replies: 3 comments

Uh oh!

zRzRzRzRzRzRzR Feb 2, 2024 Maintainer

Uh oh!

Roych13 Feb 4, 2024

Uh oh!

KevinFanng Feb 6, 2024

Dorry-wangxixi
Feb 2, 2024

zRzRzRzRzRzRzR
Feb 2, 2024
Maintainer

Roych13
Feb 4, 2024

KevinFanng
Feb 6, 2024