Skip to content

Unable to extract entire figure from PDF academic paper #4583

@jamesbraza

Description

@jamesbraza

Description of the bug

I am working with a PDF export of the paper "PaSa: An LLM Agent for Comprehensive Academic Paper Search" from https://arxiv.org/abs/2501.10120.

PyMuPDF is failing to import:

  1. The PaSa icon on page 1
  2. Figure 1 on page 2 gets read in as many individual and small images

Can PyMuPDF support figures better from academic papers?

How to reproduce the bug

With Python 3.13.2, pymupdf==1.26.1, and pydantic==2.11.7:

import pathlib

import pymupdf
from pydantic import BaseModel, Field, JsonValue

THIS_DIR = pathlib.Path(__file__).parent


class ParsedImage(BaseModel):
    """Raw image parsed from a document's page."""

    index: int = Field(description="Index of the image in a given page.")
    data: bytes = Field(
        description="Raw image, ideally directly savable to an image file."
    )
    info: dict[str, JsonValue | tuple[float, ...] | bytes] = Field(
        default_factory=dict, description="Optional image metadata."
    )


content: dict[str, tuple[str, list[ParsedImage]]] = {}
with pymupdf.open(THIS_DIR / "pasa.pdf") as file:
    for i in range(file.page_count):
        page = file.load_page(i)
        content[str(i + 1)] = page.get_text("text", sort=True), [
            ParsedImage(
                index=img_index,
                data=file.extract_image(img_info["xref"])["image"],
                info=img_info,
            )
            for img_index, img_info in enumerate(
                # Extract images all at once using get_image_info()
                page.get_image_info(hashes=True, xrefs=True)
            )
        ]
assert content["1"][1], "Expected image on page 1 to be present"
assert len(content["2"][1]) < 5, "Expected figure 1 to be read-in cohesively"

PyMuPDF version

1.26.1

Operating system

MacOS

Python version

3.13

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions