Skip to content

Dial RAG fails to extract text from certain PDF files #94

@Allob

Description

@Allob

Name and Version

ai-dial-rag 0.38.0

What steps will reproduce the bug?

Dial RAG fails to extract text from certain PDF files using unstructured.

Unstructured partition_pdf_or_image gets an exception TypeError: unsupported format string passed to list.__format__ from pdfminer and returns empty content for the document.
The document still can be processed by the visual retrieval pipeline, but the text pipeline gets empty text.

Looks like the issue affects Dial RAG versions from 0.34.0, after pdfplumber and pdfminer.six update in this PR https://github.com/epam/ai-dial-rag/pull/30/changes

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions