-
Notifications
You must be signed in to change notification settings - Fork 680
Closed
Description
Description of the bug
I am working with a PDF export of the paper "PaSa: An LLM Agent for Comprehensive Academic Paper Search" from https://arxiv.org/abs/2501.10120.
PyMuPDF is failing to import:
- The PaSa icon on page 1
- Figure 1 on page 2 gets read in as many individual and small images
Can PyMuPDF support figures better from academic papers?
How to reproduce the bug
With Python 3.13.2, pymupdf==1.26.1, and pydantic==2.11.7:
import pathlib
import pymupdf
from pydantic import BaseModel, Field, JsonValue
THIS_DIR = pathlib.Path(__file__).parent
class ParsedImage(BaseModel):
"""Raw image parsed from a document's page."""
index: int = Field(description="Index of the image in a given page.")
data: bytes = Field(
description="Raw image, ideally directly savable to an image file."
)
info: dict[str, JsonValue | tuple[float, ...] | bytes] = Field(
default_factory=dict, description="Optional image metadata."
)
content: dict[str, tuple[str, list[ParsedImage]]] = {}
with pymupdf.open(THIS_DIR / "pasa.pdf") as file:
for i in range(file.page_count):
page = file.load_page(i)
content[str(i + 1)] = page.get_text("text", sort=True), [
ParsedImage(
index=img_index,
data=file.extract_image(img_info["xref"])["image"],
info=img_info,
)
for img_index, img_info in enumerate(
# Extract images all at once using get_image_info()
page.get_image_info(hashes=True, xrefs=True)
)
]
assert content["1"][1], "Expected image on page 1 to be present"
assert len(content["2"][1]) < 5, "Expected figure 1 to be read-in cohesively"PyMuPDF version
1.26.1
Operating system
MacOS
Python version
3.13
Metadata
Metadata
Assignees
Labels
No labels