Skip to content

PyMuPDF loader with extract_images=True and grayscale imageΒ #29586

@VelizarVESSELINOV

Description

@VelizarVESSELINOV

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

pdf = PyMuPDFLoader(not_sharable_file_path, extract_images=True)

Error Message and Stack Trace (if applicable)

    return list(self._lazy_load(**kwargs))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xyz\Python312\Lib\site-packages\langchain_community\document_loaders\pdf.py", line 567, in _lazy_load
    yield from parser._lazy_parse(blob, text_kwargs=kwargs)
  File "xyz\Python312\Lib\site-packages\langchain_community\document_loaders\parsers\images.py", line 51, in lazy_parse
    img = Img.fromarray(numpy.load(buf))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "xyz\Python312\Lib\site-packages\PIL\Image.py", line 3315, in fromarray
    raise TypeError(msg) from e
TypeError: Cannot handle this data type: (1, 1, 1), |u1

Description

Current code:

                if blob.mimetype == "application/x-npy":
                    img = Img.fromarray(numpy.load(buf))

Suggestion modification:

                if blob.mimetype == "application/x-npy":
                    array = numpy.load(buf)

                    # https://stackoverflow.com/questions/55319949/pil-typeerror-cannot-handle-this-data-type
                    if array.ndim == 3 and array.shape[2] == 1:  # Grayscale image
                        img = Img.fromarray(numpy.squeeze(array, axis=2), mode="L")
                    else:
                        img = Img.fromarray(array)

System Info

System Information

OS: Windows
OS Version: 10.0.26100
Python Version: 3.12.7 (tags/v3.12.7:0b05ead, Oct 1 2024, 03:06:41) [MSC v.1941 64 bit (AMD64)]

Package Information

langchain_core: 0.3.33
langchain: 0.3.17
langchain_community: 0.3.16
langsmith: 0.1.135
langchain_aws: 0.2.12
langchain_experimental: 0.3.4
langchain_text_splitters: 0.3.5

Optional packages not installed

langserve

Other Dependencies

aiohttp: 3.10.10
async-timeout: Installed. No version info available.
boto3: 1.35.93
dataclasses-json: 0.6.7
httpx: 0.27.2
httpx-sse: 0.4.0
jsonpatch: 1.33
numpy: 1.26.4
orjson: 3.10.7
packaging: 24.1
pydantic: 2.9.2
pydantic-settings: 2.5.2
PyYAML: 6.0.2
requests: 2.32.3
requests-toolbelt: 1.0.0
SQLAlchemy: 2.0.37
tenacity: 8.5.0
typing-extensions: 4.12.2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions