pypdfloader.ipynb image extraction not work #31777

AaaBin · 2025-06-30T03:48:51Z

AaaBin
Jun 30, 2025

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain_community.document_loaders.parsers import LLMImageBlobParser
from langchain_openai import ChatOpenAI

loader = PyPDFLoader(
    "./example_data/layout-parser-paper.pdf",
    mode="page",
    images_inner_format="markdown-img",
    images_parser=LLMImageBlobParser(model=ChatOpenAI(model="gpt-4o", max_tokens=1024)),
)
docs = loader.load()
print(docs[5].page_content)

Description

I'm currently working with the PyPDFLoader, following the demo notebook located at docs/docs/integrations/document_loaders/pypdfloader.ipynb.

What I'm doing:
I am running the provided Jupyter notebook. The notebook demonstrates loading a PDF document and extracting its contents, including images, using PyPDFLoader.

What I expect to happen:
Based on the output shown in the demo notebook, I expect the PyPDFLoader to successfully extract and make available the images embedded within the PDF document.

What is currently happening:
When I execute the notebook in my Colab environment, the PDF content is loaded without any errors. However, I am not observing any images being extracted or returned. There are no error messages or warnings indicating a failure in image extraction. This results in a discrepancy between my output and the expected output shown in the pypdfloader.ipynb notebook.

System Info

System Information

OS: Linux
OS Version: #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
Python Version: 3.11.13 (main, Jun 4 2025, 08:57:29) [GCC 11.4.0]

Package Information

langchain_core: 0.3.66
langchain: 0.3.26
langchain_community: 0.3.26
langsmith: 0.4.1
langchain_openai: 0.3.27
langchain_text_splitters: 0.3.8

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.51: Installed. No version info available.
langchain-core<1.0.0,>=0.3.66: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-perplexity;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.8: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.26: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith>=0.1.125: Installed. No version info available.
langsmith>=0.1.17: Installed. No version info available.
langsmith>=0.3.45: Installed. No version info available.
numpy>=1.26.2;: Installed. No version info available.
numpy>=2.1.0;: Installed. No version info available.
openai-agents: Installed. No version info available.
openai<2.0.0,>=1.86.0: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.18
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.7
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic>=2.7.4: Installed. No version info available.
pytest: 8.3.5
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 13.9.4
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

@AaaBin · 2025-06-30T03:54:24Z

dosubot[bot]
bot Jun 30, 2025

Hi @AaaBin! I'm Dosu and I’m helping the langchain team.

Image extraction with PyPDFLoader can silently fail if certain dependencies are missing or if the parser configuration isn't compatible with your PDF. Here are some steps to help troubleshoot:

Make sure you have all required dependencies installed. For LLMImageBlobParser, you need Pillow and pypdf. If you try RapidOCRBlobParser or TesseractBlobParser, you'll also need rapidocr-onnxruntime or pytesseract respectively. The demo notebook shows how to install these:
```
%pip install -qU rapidocr-onnxruntime
%pip install -qU pytesseract
```
See notebook example
Try swapping out LLMImageBlobParser for RapidOCRBlobParser() or TesseractBlobParser() to see if images are extracted with those parsers. This can help determine if the issue is parser-specific.
Double-check that your PDF actually contains extractable images and that you are using the same file as in the demo notebook.
If you still see no images and no errors, try enabling debug logging or adding print statements in the loader/parser code to surface any silent failures.
If none of the above work, try running a minimal script with just the image extraction logic and a known-good PDF to isolate the problem.

Let me know what you find after trying these steps!

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

AaaBin Jun 30, 2025
Author

rapidocr-onnxruntime and pytesseract are already installed in the notebook example and I've tried all parsers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pypdfloader.ipynb image extraction not work #31777

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

pypdfloader.ipynb image extraction not work #31777

Uh oh!

AaaBin Jun 30, 2025

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Optional packages not installed

Other Dependencies

Replies: 1 comment · 1 reply

Uh oh!

dosubot[bot] bot Jun 30, 2025

Uh oh!

AaaBin Jun 30, 2025 Author

AaaBin
Jun 30, 2025

Replies: 1 comment 1 reply

dosubot[bot]
bot Jun 30, 2025

AaaBin Jun 30, 2025
Author