ppStructure同时调用ocr_engine()报错 #14528

IeohMingChan · 2025-01-08T09:55:27Z

IeohMingChan
Jan 8, 2025

🔎 Search before asking

I have searched the PaddleOCR Docs and found no similar bug report.
I have searched the PaddleOCR Issues and found no similar bug report.
I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

我使用fastAPI接口来接收pdf文件地址，随后调用ocr_engine对pdf进行解析，发现当一个pdf文件在解析过程中如果收到另一个pdf文件并且调用同一个ppStructure实例开始解析pdf，在两个解析任务并行运行一段时间后会发生报错，错误内容为：

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/routing.py", line 715, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/routing.py", line 735, in app
    await route.handle(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/routing.py", line 288, in handle
    await self.app(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/routing.py", line 76, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/_exception_handler.py", line 62, in wrapped_app
    raise exc
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/_exception_handler.py", line 51, in wrapped_app
    await app(scope, receive, sender)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/routing.py", line 73, in app
    response = await f(request)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/fastapi/routing.py", line 301, in app
    raw_response = await run_endpoint_function(
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/fastapi/routing.py", line 214, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/anyio/to_thread.py", line 56, in run_sync
    return await get_async_backend().run_sync_in_worker_thread(
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 2405, in run_sync_in_worker_thread
    return await future
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 914, in run
    result = context.run(func, *args)
  File "/home/chenlm/srdkb-loader/loaders/pdf_loader/ocr_service.py", line 422, in ocr
    dict_list, minioClinet = ocr_service.process_file(file_path, isMinio)
  File "/home/chenlm/srdkb-loader/loaders/pdf_loader/ocr_service.py", line 52, in process_file
    result = self.ocr_engine(file_path)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/paddleocr/paddleocr.py", line 850, in __call__
    res, _ = super().__call__(
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/paddleocr/ppstructure/predict_system.py", line 135, in __call__
    text_res, ocr_time_dict = self._predict_text(img)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/paddleocr/ppstructure/predict_system.py", line 189, in _predict_text
    filter_boxes, filter_rec_res, ocr_time_dict = self.text_system(img)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/paddleocr/tools/infer/predict_system.py", line 144, in __call__
    rec_res, elapse = self.text_recognizer(img_crop_list)
  File "/data/local/anaconda3/envs/loader/lib/python3.10/site-packages/paddleocr/tools/infer/predict_rec.py", line 698, in __call__
    rec_res[indices[beg_img_no + rno]] = rec_result[rno]
IndexError: index 58 is out of bounds for axis 0 with size 58。其中58并非固定值。以下是相关代码及报错截图，感谢解答
![image](https://github.com/user-attachments/assets/26d220a0-7465-4391-bf78-f5d242267e5b)
![image](https://github.com/user-attachments/assets/fc65c35b-d372-422a-9bfd-dabdf8013f13)

🏃‍♂️ Environment (运行环境)

paddleocr 2.8.1
paddlepaddle-gpu 2.6.2
python 3.10
Ubuntu 22.04.5 LTS

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

import io
import json
import uuid
import re
import time
from fastapi import FastAPI, Form, BackgroundTasks, APIRouter
from fastapi.responses import JSONResponse
from io import BytesIO
from paddleocr import PPStructure
from loguru import logger
from ocr_config import *
app = FastAPI()

from typing import List, Dict
import pandas as pd
from bs4 import BeautifulSoup
import httpx
class OCRService:
    def __init__(self):

        self.return_table = True
        self.CATALOG_DETECTION_THRESHOLD = 3
        self.ocr_engine = PPStructure(table=self.return_table, ocr=True, show_log=False, structure_version="PP-StructureV2")

    def process_file(self, file_path: str):

        result = self.ocr_engine(file_path)
        return result


ocr_service = OCRService()

@app.post("/ocr")
def ocr(file_path: str = Form(...), isMinio: bool = Form(...)):
    try:

        result = ocr_service.process_file(file_path, isMinio)
        return JSONResponse(content={"status": "success", "dict_list": result}, status_code=200)


    except Exception as e:
        logger.error(f"OCR处理失败: {e}")
        return JSONResponse(content={"status": "error", "message": str(e)}, status_code=500)


if __name__ == "__main__":
    import uvicorn
    uvicorn.run('ocr_service:app', host="0.0.0.0", port=1300, workers = 1)

Answered by GreatV

Jan 10, 2025

根据您提供的问题描述和错误日志，我可以看出这是一个并发调用ppStructure导致的问题。让我为您分析并提供解决方案：

问题原因：

当同时调用同一个ppStructure实例处理多个PDF时,会出现并发问题
错误显示在表格识别模块中:

InvalidArgumentError: The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [], input[1]'s shape = [1].

这表明在并发处理过程中,模型输入输出维度不匹配

解决方案:

方案1: 使用多进程而不是多线程

from multiprocessing import Pool

def process_pdf(file_path):
    ocr_engine = PPStructure(...)
    result = ocr_engine(file_path)
    return result

def parallel_process_pdfs(file_paths):
    with Pool() as pool:
        results = pool.map(process_pdf, file_paths)
    return results

方案2: 每个请求创建新的ppStructure实例

@app.post("/ocr")
def ocr(file_path: str = Form(...)):
    try:
  …

View full answer

jingsongliujing · 2025-01-08T10:13:00Z

jingsongliujing
Jan 8, 2025
Collaborator

相当于模型你只起了一个实例，而paddleocr推理的batch是1，你上一个文件没处理完，肯定会抛出异常，建议加一个队列解决这个问题

0 replies

GreatV · 2025-01-10T05:47:57Z

GreatV
Jan 10, 2025
Maintainer

根据您提供的问题描述和错误日志，我可以看出这是一个并发调用ppStructure导致的问题。让我为您分析并提供解决方案：

问题原因：

当同时调用同一个ppStructure实例处理多个PDF时,会出现并发问题
错误显示在表格识别模块中:

InvalidArgumentError: The shape of input[0] and input[1] is expected to be equal.But received input[0]'s shape = [], input[1]'s shape = [1].

这表明在并发处理过程中,模型输入输出维度不匹配

解决方案:

方案1: 使用多进程而不是多线程

from multiprocessing import Pool

def process_pdf(file_path):
    ocr_engine = PPStructure(...)
    result = ocr_engine(file_path)
    return result

def parallel_process_pdfs(file_paths):
    with Pool() as pool:
        results = pool.map(process_pdf, file_paths)
    return results

方案2: 每个请求创建新的ppStructure实例

@app.post("/ocr")
def ocr(file_path: str = Form(...)):
    try:
        # 每次创建新实例
        ocr_engine = PPStructure(table=True, ocr=True, show_log=False)
        result = ocr_engine(file_path)
        return JSONResponse(content={"status": "success", "dict_list": result})
    except Exception as e:
        logger.error(f"OCR处理失败: {e}")
        return JSONResponse(content={"status": "error", "message": str(e)})

建议：

由于ppStructure内部实现不是线程安全的,建议避免多线程共享同一实例
如果需要并发处理,推荐使用多进程或者为每个请求创建独立的ppStructure实例
可以考虑使用进程池来管理多个ppStructure实例,提高资源利用效率

补充说明：

这个问题与FLAGS_enable_pir_api设置无关
主要是ppStructure实例的并发访问导致的线程安全问题
类似问题在issue #14254中也有体现

建议您采用以上任一方案进行改造,以避免并发访问导致的问题。如果您的业务场景对性能要求较高,可以考虑使用进程池的方案。

Response generated by feifei-bot | claude-3-5-sonnet-latest

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ppStructure同时调用ocr_engine()报错 #14528

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ppStructure同时调用ocr_engine()报错 #14528

Uh oh!

Uh oh!

IeohMingChan Jan 8, 2025

🔎 Search before asking

🐛 Bug (问题描述)

🏃‍♂️ Environment (运行环境)

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

Replies: 2 comments

Uh oh!

jingsongliujing Jan 8, 2025 Collaborator

Uh oh!

GreatV Jan 10, 2025 Maintainer

IeohMingChan
Jan 8, 2025

jingsongliujing
Jan 8, 2025
Collaborator

GreatV
Jan 10, 2025
Maintainer