pdf扫描件表格识别不准确，如何优化？ #12957

PureWaterCatt · 2024-05-28T08:55:23Z

PureWaterCatt
May 28, 2024

问题描述 / Problem Description

版面分析+表格识别，包括尝试单独截下表格来识别，都无法识别到最后一行的数据。
在版面分析的时候，表格也是被完整的table（label）的矩形方框包裹着。

运行环境 / Runtime Environment

Paddle：develop
PaddleOCR：develop
OS: ubuntu 20.04
GCC version: (Ubuntu/Linaro 8.4.0-3ubuntu2) 8.4.0
Clang version: N/A
CMake version: version 3.27.7
Libc version: glibc 2.31
Python version: 3.10.14

复现代码 / Reproduction Code

python predict_table.py --image_dir=../../output/screenshot-20240528-164403.png --det_model_dir=../inference/ch_PP-OCRv4_det_infer --rec_model_dir=../inference/ch_PP-OCRv4_rec_infer --rec_char_dict_path=../../ppocr/utils/ppocr_keys_v1.txt --table_model_dir=../inference/ch_ppstructure_mobile_v2.0_SLANet_infer --table_char_dict_path=../../ppocr/utils/dict/table_structure_dict_ch.txt --output=../../output/table

完整报错 / Complete Error Message

可能解决方案 / Possible solutions

附件 / Appendix

test.zip

GreatV · 2024-05-28T09:10:07Z

GreatV
May 28, 2024
Maintainer

@PureWaterCatt 能提供原始图片吗

0 replies

PureWaterCatt · 2024-05-28T09:35:18Z

PureWaterCatt
May 28, 2024
Author

@PureWaterCatt 能提供原始图片吗
@GreatV

0 replies

GreatV · 2024-05-28T12:14:59Z

GreatV
May 28, 2024
Maintainer

[2024/05/28 20:11:42] ppocr DEBUG: Namespace(help='==SUPPRESS==', use_gpu=False, use_xpu=False, use_npu=False, use_mlu=False, ir_optim=True, use_tensorrt=False, min_subgraph_size=15, precision='fp32', gpu_mem=500, gpu_id=0, image_dir=None, page_num=0, det_algorithm='DB', det_model_dir='/Users/wangxin/.paddleocr/whl/det/ch/ch_PP-OCRv4_det_infer', det_limit_side_len=960, det_limit_type='max', det_box_type='quad', det_db_thresh=0.3, det_db_box_thresh=0.6, det_db_unclip_ratio=1.5, max_batch_size=10, use_dilation=False, det_db_score_mode='fast', det_east_score_thresh=0.8, det_east_cover_thresh=0.1, det_east_nms_thresh=0.2, det_sast_score_thresh=0.5, det_sast_nms_thresh=0.2, det_pse_thresh=0, det_pse_box_thresh=0.85, det_pse_min_area=16, det_pse_scale=1, scales=[8, 16, 32], alpha=1.0, beta=1.0, fourier_degree=5, rec_algorithm='SVTR_LCNet', rec_model_dir='/Users/wangxin/.paddleocr/whl/rec/ch/ch_PP-OCRv4_rec_infer', rec_image_inverse=True, rec_image_shape='3, 48, 320', rec_batch_num=6, max_text_length=25, rec_char_dict_path='/Users/wangxin/repos/PaddleOCR/ppocr/utils/ppocr_keys_v1.txt', use_space_char=True, vis_font_path='./doc/fonts/simfang.ttf', drop_score=0.5, e2e_algorithm='PGNet', e2e_model_dir=None, e2e_limit_side_len=768, e2e_limit_type='max', e2e_pgnet_score_thresh=0.5, e2e_char_dict_path='./ppocr/utils/ic15_dict.txt', e2e_pgnet_valid_set='totaltext', e2e_pgnet_mode='fast', use_angle_cls=False, cls_model_dir=None, cls_image_shape='3, 48, 192', label_list=['0', '180'], cls_batch_num=6, cls_thresh=0.9, enable_mkldnn=False, cpu_threads=10, use_pdserving=False, warmup=False, sr_model_dir=None, sr_image_shape='3, 32, 128', sr_batch_num=1, draw_img_save_dir='./inference_results', save_crop_res=False, crop_res_save_dir='./output', use_mp=False, total_process_num=1, process_id=0, benchmark=False, save_log_path='./log_output/', show_log=True, use_onnx=False, return_word_box=False, output='./output', table_max_len=488, table_algorithm='TableAttn', table_model_dir='/Users/wangxin/.paddleocr/whl/table/ch_ppstructure_mobile_v2.0_SLANet_infer', merge_no_span_structure=True, table_char_dict_path='/Users/wangxin/repos/PaddleOCR/ppocr/utils/dict/table_structure_dict_ch.txt', layout_model_dir='/Users/wangxin/.paddleocr/whl/layout/picodet_lcnet_x1_0_fgd_layout_cdla_infer', layout_dict_path='/Users/wangxin/repos/PaddleOCR/ppocr/utils/dict/layout_dict/layout_cdla_dict.txt', layout_score_threshold=0.5, layout_nms_threshold=0.5, kie_algorithm='LayoutXLM', ser_model_dir=None, re_model_dir=None, use_visual_backbone=True, ser_dict_path='../train_data/XFUND/class_list_xfun.txt', ocr_order_method=None, mode='structure', image_orientation=True, layout=True, table=True, ocr=True, recovery=False, use_pdf2docx_api=False, invert=False, binarize=False, alphacolor=(255, 255, 255), lang='ch', det=True, rec=True, type='ocr', savefile=False, ocr_version='PP-OCRv4', structure_version='PP-StructureV2')
[2024/05/28 20:11:42] ppcls WARNING: The current running environment does not support the use of GPU. CPU has been used instead.
[2024/05/28 20:11:44] ppocr DEBUG: dt_boxes num : 43, elapsed : 0.09825301170349121
[2024/05/28 20:11:47] ppocr DEBUG: rec_res num  : 43, elapsed : 3.3741910457611084
[2024/05/28 20:11:48] ppocr DEBUG: dt_boxes num : 43, elapse : 0.09876203536987305
[2024/05/28 20:11:51] ppocr DEBUG: rec_res num  : 43, elapse : 3.1829168796539307
{'type': 'table', 'bbox': [0, 0, 458, 361], 'res': {'cell_bbox': [[5.933736324310303, 3.846806526184082, 151.30047607421875, 4.098109722137451, 151.35910034179688, 90.56180572509766, 5.676283836364746, 90.94571685791016], [147.81179809570312, 3.6569559574127197, 225.90658569335938, 3.77463698387146, 227.0962371826172, 79.76953125, 147.03846740722656, 80.49063110351562], [233.22488403320312, 3.6309869289398193, 284.1729736328125, 3.738816022872925, 285.87506103515625, 77.31891632080078, 234.9944610595703, 77.89393615722656], [279.1354675292969, 4.3987555503845215, 321.6705017089844, 4.548232555389404, 323.4425048828125, 78.41556549072266, 280.9509582519531, 78.96294403076172], [335.737060546875, 4.784805774688721, 362.3387145996094, 5.019000053405762, 362.94622802734375, 77.28314971923828, 336.6397705078125, 77.39521026611328], [388.3001708984375, 5.19167423248291, 452.4793701171875, 5.577062129974365, 452.38336181640625, 90.18814086914062, 387.8672790527344, 88.91836547851562], [6.535987377166748, 70.59809112548828, 156.50340270996094, 71.90779876708984, 156.2332000732422, 170.76112365722656, 6.474782943725586, 168.9591064453125], [143.13043212890625, 73.70213317871094, 225.41241455078125, 74.99711608886719, 225.60519409179688, 172.0007781982422, 143.0009002685547, 170.54078674316406], [234.8168487548828, 75.92626953125, 285.4459228515625, 77.37468719482422, 285.9234619140625, 169.67877197265625, 234.77297973632812, 168.8255157470703], [279.1575622558594, 80.0408935546875, 321.14984130859375, 81.3666000366211, 322.0708923339844, 170.91236877441406, 279.68658447265625, 170.24501037597656], [327.3981628417969, 81.63390350341797, 368.4235534667969, 83.271728515625, 368.04571533203125, 183.8792266845703, 326.7834777832031, 182.89442443847656], [391.68316650390625, 77.49728393554688, 450.87774658203125, 80.59970092773438, 450.4789123535156, 351.57135009765625, 389.3020935058594, 351.1860656738281], [18.0737361907959, 164.57754516601562, 143.79637145996094, 164.93113708496094, 144.43597412109375, 223.82362365722656, 17.937854766845703, 224.24269104003906], [143.6466522216797, 168.29611206054688, 212.39947509765625, 169.4072723388672, 213.32969665527344, 219.197021484375, 143.596923828125, 219.14364624023438], [227.44757080078125, 167.58865356445312, 278.9219970703125, 169.0375213623047, 279.76116943359375, 218.89549255371094, 227.97239685058594, 218.91354370117188], [273.3187561035156, 169.4761962890625, 319.05877685546875, 171.0519256591797, 320.08544921875, 221.45974731445312, 273.823974609375, 221.37847900390625], [316.6603088378906, 168.15016174316406, 356.101806640625, 169.28562927246094, 355.83355712890625, 223.83612060546875, 315.95855712890625, 223.8933563232422], [11.54481315612793, 209.0058135986328, 149.67161560058594, 210.10269165039062, 150.80596923828125, 313.8858642578125, 11.31093692779541, 313.5046691894531], [142.2597198486328, 220.2042999267578, 210.1415252685547, 220.74473571777344, 208.92703247070312, 306.9714660644531, 140.53517150878906, 306.3691101074219], [230.01699829101562, 218.03366088867188, 278.3423767089844, 218.091064453125, 278.7393493652344, 309.50640869140625, 229.2131805419922, 309.4779052734375], [279.83758544921875, 218.29098510742188, 322.95928955078125, 217.55064392089844, 322.50897216796875, 312.3780822753906, 278.6295471191406, 312.7047119140625], [335.04815673828125, 219.59906005859375, 366.26019287109375, 219.55929565429688, 364.4071044921875, 315.0625305175781, 332.7979736328125, 315.12701416015625]], 'html': '<html><body><table><tr><td>岗位职级名称</td><td>职级 号码</td><td>二 类城市</td><td>三类 城市</td><td>香港</td><td>国(境)外</td></tr><tr><td>总经理、副总经理、 高级专业经理、高级 销售经理、专业总监</td><td>M3及以上、 P5及以上</td><td>600</td><td>500</td><td>1500</td><td rowspan="3">标准间及 参照财行 [2013] 516号、 财行 [2017]43 4号住宿 费标准执 行</td></tr><tr><td>部门经理</td><td>M2</td><td>400</td><td>350</td><td>1100</td></tr><tr><td>专业经理、高级技师 (主任)、销售经理、 助理专业经理、技 师、助理销售经理 其余人员S1、S2、S3、P1、P2</td><td>P3、P4 S4、S5</td><td>350 300</td><td>300</td><td>1100 2601100</td></tr></table></body></html>'}, 'img_idx': 0}

已在main分支修复，将会在2.8.0版本发布

0 replies

PureWaterCatt · 2024-05-29T11:29:09Z

PureWaterCatt
May 29, 2024
Author

@GreatV 大大main分支已经适配了吗，我新pull下来测试了一下好像没变化

0 replies

GreatV · 2024-05-29T11:37:24Z

GreatV
May 29, 2024
Maintainer

我是这么测试的

import os
import cv2
from paddleocr import PPStructure,draw_structure_result,save_structure_res
from PIL import Image

table_engine = PPStructure(show_log=True, image_orientation=True)

save_folder = './output'
img_path = './334352202-c065007d-96b3-4a30-ab4d-42be47ec3ee8.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

0 replies

PureWaterCatt · 2024-05-29T11:49:55Z

PureWaterCatt
May 29, 2024
Author

我是这么测试的

import os
import cv2
from paddleocr import PPStructure,draw_structure_result,save_structure_res
from PIL import Image

table_engine = PPStructure(show_log=True, image_orientation=True)

save_folder = './output'
img_path = './334352202-c065007d-96b3-4a30-ab4d-42be47ec3ee8.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

好像和之前的效果是一样的，最后一行的单元格格式不对，按理来讲应该有5行

0 replies

GreatV · 2024-05-29T12:00:21Z

GreatV
May 29, 2024
Maintainer

是的，表格结果没有识别正确，我只关注了文字识别成功了。可能把图往外padding一些效果会好点。

0 replies

pdf扫描件表格识别不准确，如何优化？ #12957

Uh oh!

Uh oh!

PureWaterCatt May 28, 2024

问题描述 / Problem Description

运行环境 / Runtime Environment

复现代码 / Reproduction Code

完整报错 / Complete Error Message

可能解决方案 / Possible solutions

附件 / Appendix

Replies: 7 comments

Uh oh!

GreatV May 28, 2024 Maintainer

Uh oh!

Uh oh!

PureWaterCatt May 28, 2024 Author

Uh oh!

GreatV May 28, 2024 Maintainer

Uh oh!

PureWaterCatt May 29, 2024 Author

Uh oh!

GreatV May 29, 2024 Maintainer

Uh oh!

PureWaterCatt May 29, 2024 Author

Uh oh!

Uh oh!

GreatV May 29, 2024 Maintainer

PureWaterCatt
May 28, 2024

GreatV
May 28, 2024
Maintainer

PureWaterCatt
May 28, 2024
Author

GreatV
May 28, 2024
Maintainer

PureWaterCatt
May 29, 2024
Author

GreatV
May 29, 2024
Maintainer

PureWaterCatt
May 29, 2024
Author

GreatV
May 29, 2024
Maintainer