can i get tables and text from pdf separately with PaddleOCR ? #12276
Unanswered
jay22mehta
asked this question in
Q&A
Replies: 1 comment
-
You can use Layout analysis to judge text and table region. refer to https://github.com/PaddlePaddle/PaddleOCR/blob/2b3b3554c05ae615ed7eb051c2ac7c6bb8bc985d/ppstructure/README.md |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Currently for getting tables i'm using this part of code for getting tables as excel file.(2.2.4 table recognition)
import os
import cv2
import PIL
import paddleclas
import paddle
from paddleocr import PPStructure,draw_structure_result,save_structure_res
table_engine = PPStructure(layout=False, show_log=True) # table recognition
save_folder = 'output'
img_path = 'example.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder, os.path.basename(img_path).split('.')[0])
And for getting text from pdf file i'm using this code below with this tables are also converting into text which i don't want.
from paddleocr import PaddleOCR, draw_ocr
Paddleocr supports Chinese, English, French, German, Korean and Japanese.
You can set the parameter
lang
asch
,en
,fr
,german
,korean
,japan
to switch the language model in order.ocr = PaddleOCR(use_angle_cls=True, lang="en", page_num=0) # need to run only once to download and load model into memory
img_path = 'tables/example.pdf'
result = ocr.ocr(img_path, cls=True)
def ocr_to_txt(result):
text= ""
for line in result:
for word in line:
text += word[1][0] + " "
text += "\n"
return text
text = ocr_to_txt(result)
with open ("ocr_results.txt", "w") as f:
f.write(text)
Beta Was this translation helpful? Give feedback.
All reactions