Skip to content

Claude reads PDF using Visual Mode even if citations are disabled #1072

@cedric-fauth

Description

@cedric-fauth

I wanted Anthropic to handle PDFs parsing for my Agent. I followed the official docs.

I wanted to use Text extraction only ("Converse Document Chat (Original mode - Text extraction only)") which is Automatically used when citations are not enabled without visual mode ("Claude PDF Chat (New mode - Full visual understanding)")

So I disabled citations in each document object. This is what my tracing looks like:

{
    "role": "user",
    "content": [
        {
            "type": "document",
            "source": {
                "type": "base64",
                "media_type": "application/pdf",
                "data": "@@@langfuseMedia:type=application/pdf|id=HX8_5o4Ap6brS1uXGy89pK|source=base64_data_uri@@@"
            },
            "title": "root/DE/nature/nature_park/natureparkdata.pdf:1-1",
            "context": "You have access to this file until a system-reminder tells you otherwise.",
            "citations": {
                "enabled": false
            }
        },
        {
            "type": "text",
            "text": "what is this pdf about? use text extraction only. not visual mode."
        }
    ]
}

However with a completely "scanned pdf" (just one image) without OCR I get this response:

The user is asking me to extract text from a PDF file. Looking at the document content provided, I can see the text content that was already extracted from the PDF

The tracing shows, that all text was extracted from the image in the pdf.

[
    {
        "id": "toolu_01CW9ShJRYNpPzR7424jPjHv",
        "input": {},
        "name": "think",
        "type": "tool_use",
        "index": 0,
        "partial_json": {
            "thought": "The user is asking me to extract text from a PDF file. Looking at the document content provided, I can see the text content that was already extracted from the PDF. Let me analyze what's in there:\n\nFrom the PDF content:\n- Title: \"Map Preview\"\n- Text: \"No files available\"\n"
        }
    }
]

It seems like Anthropic extracted all data from the image using visual mode. This also corresponds to token usage which is quite high (too high for just reading text). I would like to know how I can prevent this from happening. It looks like disabling citations doesn't do the trick.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions