Skip to content
Discussion options

You must be logged in to vote

@timif2 Good to see this question coming up 😃 .

There are several things you can do to improve the performance, depending on the use case you have. The pipeline features, ordered from most expensive to cheapest: OCR, table structure recognition, PDF parsing. My recommendations are:

  1. Turn off OCR if you don't need it for your data (e.g. you bring digital-only PDFs)
  2. Turn of table structure recognition if you don't need table structure (e.g. your PDFs have no tables or you don't need the table's content)
    • only possible in python API code, see below.
  3. Switch the PDF backend to DoclingParseV2DocumentBackend (beta), which speeds up PDF loading by ~10x, with good impact o…

Replies: 4 comments 18 replies

Comment options

You must be logged in to vote
16 replies
@JonoReshefAltaML
Comment options

@sujalthink41
Comment options

@alexisdrakopoulos
Comment options

@MarioRicoIbanez
Comment options

@hisan-ideamaker
Comment options

Answer selected by timif2
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@psinojiya
Comment options

@tjhoo
Comment options

Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet