文本识别训练过程中出现list index out of range问题 #12912

phb-shiyige-fw · 2023-12-09T12:50:37Z

phb-shiyige-fw
Dec 9, 2023

请提供下述完整信息以便快速定位问题/Please provide the following information to quickly locate the problem

系统环境/System Environment：win10
版本号/Version：Paddle：2.3.2 PaddleOCR：2.6 问题相关组件/Related components：rec-PPOCRv3
运行指令/Command Code：train.py
完整报错/Complete Error Message：
[2023/12/09 20:45:37] ppocr ERROR: When parsing line crop_img/table246_crop_11.jpg 6.车辆识别代号/车架号LHGCV1676P9038532
, error happened with msg: Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 158, in getitem
data['ext_data'] = self.get_ext_data()
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 124, in get_ext_data
label = substr[1]
IndexError: list index out of range
进行模型微调的训练过程中，总是会出现如上述所示的错误，好像是显示训练数据有问题，但是不知道问题在哪里？不知道如何解决。求各位大佬帮忙分析一下是什么原因导致。感激不尽！！！！@haobibo @WenmuZhou @ZeyuChen @bingooo

Answered by honghanh2008

Dec 11, 2023

        data_line = data_line.decode('utf-8')
        substr = data_line.strip("\n").split(self.delimiter)
        file_name = substr[0]
        file_name = self._try_parse_filename_list(file_name)
        label = substr[1]
        
        你检测一下  substr = data_line.strip("\n").split(self.delimiter)  ，大概率是你准备的训练数据集格式不符合要求，导致label读取失败。

View full answer

honghanh2008 · 2023-12-11T02:20:45Z

honghanh2008
Dec 11, 2023

        data_line = data_line.decode('utf-8')
        substr = data_line.strip("\n").split(self.delimiter)
        file_name = substr[0]
        file_name = self._try_parse_filename_list(file_name)
        label = substr[1]
        
        你检测一下  substr = data_line.strip("\n").split(self.delimiter)  ，大概率是你准备的训练数据集格式不符合要求，导致label读取失败。

0 replies

aaakouaaaa · 2023-12-18T09:40:56Z

aaakouaaaa
Dec 18, 2023

你好你的问题解决了吗

0 replies

Will439 · 2024-04-18T10:19:19Z

Will439
Apr 18, 2024

我也遇到一样的问题，substr明显有两个元素，字典也有这个字，但是它还是报错
代码：
except:
print("data_line=", data_line)
print("data_line.decode('utf-8')=", self.data_lines[file_idx].decode('utf-8'))
print("substr = data_line.strip('/n').split(self.delimiter)=",
self.data_lines[file_idx].decode('utf-8').strip("\n").split(self.delimiter))
self.logger.error(
"When parsing line {}, error happened with msg: {}".format(
data_line, traceback.format_exc()))

输出及报错：
data_line= ABC03052_12.jpg 出

data_line.decode('utf-8')= ABC03052_12.jpg 出

substr = data_line.strip('/n').split(self.delimiter)= ['ABC03052_12.jpg', '出']
[2024/04/18 18:14:02] ppocr ERROR: When parsing line ABC03052_12.jpg 出
, error happened with msg: Traceback (most recent call last):
File "/home/qx/HaihenOCR/ppocr/data/simple_dataset.py", line 252, in getitem
data['ext_data'] = self.get_ext_data()
File "/XXX/ppocr/data/simple_dataset.py", line 118, in get_ext_data
label = substr[1]
IndexError: list index out of range

0 replies

Feisty19 · 2024-06-14T02:05:31Z

Feisty19
Jun 14, 2024

遇到了同样的问题，昨天自己查了很久，分享一下我的解决方法。我的是因为数据集划分后.txt文档每行中间有空格，修改PPOCRLabel/gen_ocr_train_val_test.py 中45-56行：train_txt.write("{}\t{}\n".format(image_copy_path, image_label)) 改为 train_txt.write("{}\t{}".format(image_copy_path, image_label))，即去掉\n换行，重新划分数据集，再训练时就不报错了。

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

文本识别训练过程中出现list index out of range问题 #12912

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

文本识别训练过程中出现list index out of range问题 #12912

Uh oh!

phb-shiyige-fw Dec 9, 2023

Replies: 4 comments

Uh oh!

honghanh2008 Dec 11, 2023

Uh oh!

aaakouaaaa Dec 18, 2023

Uh oh!

Uh oh!

Will439 Apr 18, 2024

Uh oh!

Feisty19 Jun 14, 2024

phb-shiyige-fw
Dec 9, 2023

honghanh2008
Dec 11, 2023

aaakouaaaa
Dec 18, 2023

Will439
Apr 18, 2024

Feisty19
Jun 14, 2024