使用识别模型训练输入python tools/train.py -c pretrain_models/ch_PP-OCRv3_rec_distillation.yml,报错IndexError: list index out of range,请帮忙看下哪里出现问题呢 #14963
Replies: 3 comments
-
|
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
你在运行 Python tools/train.py -c pretrain_models/ch_PP-OCRv3_rec_distillation.yml 进行文本识别模型训练时报错: IndexError: list index out of range,该错误多次出现在 simple_dataset.py 第 150 行 label = substr[1],表示数据解析失败,主要原因在于你的训练数据格式不符合预期。 问题原因分析: PaddleOCR 识别模型的数据格式通常是 txt 文件,每行一条数据,格式如下: 图像路径标签内容 你提供的训练数据例如: train_data/rec/train/000012.jpg 显卡:NVIDIA® GeForce RTX™ 4060 看似格式正确,实则问题可能有:
建议解决方案:
确保它引用的是你清洗过的新标签文件: 例如: Train:
参考 Issue: 此问题与 PaddleOCR 官方 issue #13826 有关:#13826 该 issue 同样反映了 label 格式异常导致 IndexError 的情况。
如果你在 Windows 上运行,控制台默认编码是 gbk,而你的标签文本中含有 Unicode 字符(如®),建议: 训练时加入环境变量,避免编码错误影响日志: set PYTHONIOENCODING=utf-8 或将 logging 配置为输出 UTF-8。 总结: 你遇到的 IndexError: list index out of range 错误,很可能是由于训练列表文件中存在格式异常行,例如标签为空、使用空格分隔、或其他非法字符。请重点清查你的标注文件并按上述格式清洗后重新训练。 Response generated by 🤖 feifei-bot | chatgpt-4o-latest |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
🔎 Search before asking
🐛 Bug (问题描述)
部分训练集如下:
train_data/rec/train/000012.jpg 显卡:NVIDIA® GeForce RTX™ 4060
train_data/rec/train/000013.jpg 内存容量:32GB
train_data/rec/train/000014.jpg 处理器:Intel® Core™ Ultra 5 processor 125H
train_data/rec/train/000015.jpg 显卡:Intel® Arc™ graphics
train_data/rec/train/000016.jpg 显卡:Intel®Iris®Xe Graphics
train_data/rec/train/000017.jpg 屏幕:16英寸 FHD+
train_data/rec/train/000018.jpg 6 941812 767665
报错信息如下:
Message: 'When parsing line train_data/rec/train/000279.jpg Intel® Graphics\r\n, error happened with msg: Traceback (most recent call last):\n File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem\n label = substr[1]\nIndexError: list index out of range\n'
Arguments: ()
[2025/03/23 08:37:23] ppocr ERROR: When parsing line train_data/rec/train/000096.jpg 显卡:Intel® Graphics
, error happened with msg: Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem
label = substr[1]
IndexError: list index out of range
--- Logging error ---
Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem
label = substr[1]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\python3.10.0\lib\logging_init_.py", line 1101, in emit
stream.write(msg + self.terminator)
UnicodeEncodeError: 'gbk' codec can't encode character '\xae' in position 95: illegal multibyte sequence
Call stack:
File "D:\python3.10.0\lib\threading.py", line 966, in _bootstrap
self._bootstrap_inner()
File "D:\python3.10.0\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "D:\python3.10.0\lib\threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "D:\python3.10.0\lib\site-packages\paddle\fluid\dataloader\dataloader_iter.py", line 217, in _thread_loop
batch = self._dataset_fetcher.fetch(indices,
File "D:\python3.10.0\lib\site-packages\paddle\fluid\dataloader\fetcher.py", line 125, in fetch
data.append(self.dataset[idx])
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 169, in getitem
return self.getitem(rnd_idx)
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 169, in getitem
return self.getitem(rnd_idx)
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 169, in getitem
return self.getitem(rnd_idx)
[Previous line repeated 146 more times]
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 161, in getitem
self.logger.error(
Message: 'When parsing line train_data/rec/train/000096.jpg 显卡:Intel® Graphics\r\n, error happened with msg: Traceback (most recent call last):\n File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem\n label = substr[1]\nIndexError: list index out of range\n'
Arguments: ()
[2025/03/23 08:37:23] ppocr ERROR: When parsing line train_data/rec/train/000125.jpg 颜色:
, error happened with msg: Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem
label = substr[1]
IndexError: list index out of range
[2025/03/23 08:37:23] ppocr ERROR: When parsing line train_data/rec/train/000032.jpg 处理器:Intel® Core™ i5-13420H Processor
, error happened with msg: Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem
label = substr[1]
IndexError: list index out of range
--- Logging error ---
Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem
label = substr[1]
IndexError: list index out of range
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "D:\python3.10.0\lib\logging_init_.py", line 1101, in emit
stream.write(msg + self.terminator)
UnicodeEncodeError: 'gbk' codec can't encode character '\xae' in position 96: illegal multibyte sequence
Call stack:
File "D:\python3.10.0\lib\threading.py", line 966, in _bootstrap
self._bootstrap_inner()
File "D:\python3.10.0\lib\threading.py", line 1009, in _bootstrap_inner
self.run()
File "D:\python3.10.0\lib\threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "D:\python3.10.0\lib\site-packages\paddle\fluid\dataloader\dataloader_iter.py", line 217, in _thread_loop
batch = self._dataset_fetcher.fetch(indices,
File "D:\python3.10.0\lib\site-packages\paddle\fluid\dataloader\fetcher.py", line 125, in fetch
data.append(self.dataset[idx])
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 169, in getitem
return self.getitem(rnd_idx)
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 169, in getitem
return self.getitem(rnd_idx)
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 169, in getitem
return self.getitem(rnd_idx)
[Previous line repeated 148 more times]
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 161, in getitem
self.logger.error(
Message: 'When parsing line train_data/rec/train/000032.jpg 处理器:Intel® Core™ i5-13420H Processor\r\n, error happened with msg: Traceback (most recent call last):\n File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem\n label = substr[1]\nIndexError: list index out of range\n'
Arguments: ()
[2025/03/23 08:37:23] ppocr ERROR: When parsing line train_data/rec/train/000282.jpg 净重:
, error happened with msg: Traceback (most recent call last):
File "D:\PaddleOCR-release-2.7\ppocr\data\simple_dataset.py", line 150, in getitem
label = substr[1]
IndexError: list index out of range
🏃♂️ Environment (运行环境)
os windows 11
environment
python 3.10.0
paddleocr 2.7
install zip
ram 32.0 GB
cpu Intel(R) Core(TM) Ultra 7 155H
🌰 Minimal Reproducible Example (最小可复现问题的Demo)
部分训练集如下:
train_data/rec/train/000012.jpg 显卡:NVIDIA® GeForce RTX™ 4060
train_data/rec/train/000013.jpg 内存容量:32GB
train_data/rec/train/000014.jpg 处理器:Intel® Core™ Ultra 5 processor 125H
train_data/rec/train/000015.jpg 显卡:Intel® Arc™ graphics
train_data/rec/train/000016.jpg 显卡:Intel®Iris®Xe Graphics
train_data/rec/train/000017.jpg 屏幕:16英寸 FHD+
train_data/rec/train/000018.jpg 6 941812 767665
Beta Was this translation helpful? Give feedback.
All reactions