can i get tables and text from pdf separately with PaddleOCR ? #12276

jay22mehta · 2024-04-17T11:27:25Z

jay22mehta
Apr 17, 2024

Currently for getting tables i'm using this part of code for getting tables as excel file.(2.2.4 table recognition)

import os
import cv2
import PIL
import paddleclas
import paddle
from paddleocr import PPStructure,draw_structure_result,save_structure_res

table_engine = PPStructure(layout=False, show_log=True) # table recognition

save_folder = 'output'
img_path = 'example.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])

And for getting text from pdf file i'm using this code below with this tables are also converting into text which i don't want.

from paddleocr import PaddleOCR, draw_ocr

Paddleocr supports Chinese, English, French, German, Korean and Japanese.

You can set the parameter `lang` as `ch`, `en`, `fr`, `german`, `korean`, `japan` to switch the language model in order.

ocr = PaddleOCR(use_angle_cls=True, lang="en", page_num=0) # need to run only once to download and load model into memory
img_path = 'tables/example.pdf'
result = ocr.ocr(img_path, cls=True)

def ocr_to_txt(result):
text= ""
for line in result:
for word in line:
text += word[1][0] + " "
text += "\n"
return text
text = ocr_to_txt(result)

with open ("ocr_results.txt", "w") as f:
f.write(text)

Sunting78 · 2024-04-18T03:09:09Z

Sunting78
Apr 18, 2024
Collaborator

You can use Layout analysis to judge text and table region. refer to https://github.com/PaddlePaddle/PaddleOCR/blob/2b3b3554c05ae615ed7eb051c2ac7c6bb8bc985d/ppstructure/README.md

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

can i get tables and text from pdf separately with PaddleOCR ? #12276

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

can i get tables and text from pdf separately with PaddleOCR ? #12276

Uh oh!

Uh oh!

jay22mehta Apr 17, 2024

Currently for getting tables i'm using this part of code for getting tables as excel file.(2.2.4 table recognition)

And for getting text from pdf file i'm using this code below with this tables are also converting into text which i don't want.

Paddleocr supports Chinese, English, French, German, Korean and Japanese.

You can set the parameter lang as ch, en, fr, german, korean, japan to switch the language model in order.

Replies: 1 comment

Uh oh!

Sunting78 Apr 18, 2024 Collaborator

jay22mehta
Apr 17, 2024

You can set the parameter `lang` as `ch`, `en`, `fr`, `german`, `korean`, `japan` to switch the language model in order.

Sunting78
Apr 18, 2024
Collaborator