How to use Azure OpenAI API to describe images in Docling? #1921

frogFred · 2025-07-10T00:56:17Z

frogFred
Jul 10, 2025

Hi all, I'm currently exploring how to use Azure OpenAI (AOAI) API to generate image descriptions for document processing in Docling. I aim to integrate this into a VLM pipeline that generates Markdown summaries with image annotations. However, I’ve encountered a problem: the image descriptions are not being generated, despite the API being called with the proper config. I'm unsure if I'm missing something in how Docling expects the VLM or AOAI response format.

What I'm doing

I'm using PictureDescriptionApiOptions with AOAI like this:

def create_picture_description_options(
    model: str,
    prompt: str,
    picture_area_threshold: float = 0.0
) -> PictureDescriptionApiOptions:
    endpoint = os.getenv("AZURE_OPENAI_ENDPOINT").rstrip("/")
    url = f"{endpoint}/openai/deployments/{model}/chat/completions?api-version=2025-01-01-preview"
    headers = {
        "Content-Type": "application/json",
        "api-key": os.getenv("AOAI_API_KEY"),
    }
    params = {
        "messages": [
            {"role": "system", "content": prompt},
        ],
        "temperature": 0.7,
        "top_p": 0.95,
        "n": 1,
    }
    return PictureDescriptionApiOptions(
        url=url,
        headers=headers,
        params=params,
        timeout=300,
        prompt=prompt,
        picture_area_threshold=picture_area_threshold,
    )

And I register this with the PDF pipeline options (same for images):

InputFormat.PDF: PdfFormatOption(
    pipeline_options=PdfPipelineOptions(
        generate_picture_images=True,
        do_picture_description=True,
        enable_remote_services=True,
        accelerator_options=accelerator_options,
        picture_description_options=create_picture_description_options(
            model=MODEL_NAME,
            prompt=PROMPT
        )
    )
)

My questions

Is Docling expecting a specific response format or content structure from AOAI when describing images?

Are there examples or test cases for using external LLM services (like AOAI or OpenAI Vision) for image description?

Could this be a limitation of AOAI (not supporting image input yet in this pipeline), or is there a specific integration step I’m missing?

Notes

The deployment model is gpt-4.1-mini on Azure OpenAI.

I’ve confirmed that PictureDescriptionApiOptions is constructed and passed properly.

No errors are thrown, but no PictureDescriptionData is returned.

Any guidance or suggestions from the Docling community would be greatly appreciated!

frogFred · 2025-07-10T00:59:15Z

dosubot[bot]
bot Jul 10, 2025

Docling expects the external API's response for image descriptions to closely match the OpenAI API format, specifically returning a top-level field with the image description text (usually called "description"). If the AOAI response is nested, uses different field names, or omits this field, Docling may silently fail to produce PictureDescriptionData—even if the API call itself succeeds. This is a common cause of the issue you're seeing, and there are no explicit errors in these cases—enabling debug logging may help diagnose it further. See this discussion for details.

The core logic for parsing the API response is in the plugin files (e.g., picture_description_api.py). The response must be compatible with the internal PictureDescriptionData structure, which at minimum requires a description string. If your AOAI deployment (gpt-4.1-mini) does not support image input, or if the response format is not what Docling expects, you will not get any descriptions back—even if the API is called correctly.

There are no direct examples or test cases for AOAI or OpenAI Vision integration in the public docling-core repo, but the integration pattern is the same: ensure your API returns a JSON object with a top-level description field containing the generated text.

Checklist:

Confirm AOAI supports image input for your deployment/model.
Inspect the raw AOAI response to ensure it includes a top-level description field.
Set picture_area_threshold=0 to process all images, and enable debug logging to catch silent skips or parsing issues.
If the AOAI response format can't be changed, you may need to add a small adapter to map the response to the expected schema.

References:
OpenAI API format compatibility
Image processing thresholds and pipeline config
Pipeline options and expected response structure

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

1 reply

frogFred Jul 14, 2025
Author

@dosu hello,

Thank you for your response and guidance! Regarding the response format requirements you mentioned, I have investigated and found that AOAI does not seem to receive any image data. The details are as follows:

Model support confirmed
I am using the deployment model gpt-4.1-mini, and I have verified that this model in Azure OpenAI indeed supports image input and can generate descriptions for images.

Picture area threshold correctly set
I have set picture_area_threshold to 0 to ensure that all images are sent through the description process, with none being filtered out for having too small an area.

No description field in the response
In the raw AOAI response, there is no field named description (or any equivalent). However, I observed that when using Ollama, image descriptions are generated correctly—even though its API response does not include a description field, Docling still successfully parses it into PictureDescriptionData. From this, I infer that description may not be the only valid field name, as long as the response format can map to the structure expected by PictureDescriptionData.

dolfim-ibm · 2025-07-10T07:45:40Z

dolfim-ibm
Jul 10, 2025
Maintainer

The setup looks correct. How are you checking if the description is populated or not? Are you looking at the content of the annotations in the DoclingDocument, doing an export to markdown or others?

There is indeed the parameter picture_area_threshold which, by default, will only invoke the picture description if the image has more than 5% of the page surface.

1 reply

frogFred Jul 10, 2025
Author

Thanks for the follow-up!

Yes — I’m currently using the MarkdownDocSerializer with a custom AnnotationPictureSerializer like this:

serializer = MarkdownDocSerializer(
    doc=doc,
    picture_serializer=AnnotationPictureSerializer(),
    params=MarkdownParams(
        image_mode=ImageRefMode.EMBEDDED,
    ),
)
ser_result = serializer.serialize()
ser_text = ser_result.text

The Markdown output is where I’m checking for the generated descriptions (e.g., looking for ).

As for the picture_area_threshold, I’ve explicitly set it to 0.0 in create_picture_description_options() to ensure that even small images are processed. I've double-checked this part and confirmed it's correctly passed to the pipeline.

Let me know if there are any other checkpoints or debug signals I should look into to confirm whether the descriptions are being attempted but not returned, or simply skipped.

Thanks again for the guidance!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use Azure OpenAI API to describe images in Docling? #1921

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to use Azure OpenAI API to describe images in Docling? #1921

Uh oh!

frogFred Jul 10, 2025

What I'm doing

My questions

Notes

Replies: 2 comments · 2 replies

Uh oh!

dosubot[bot] bot Jul 10, 2025

Uh oh!

frogFred Jul 14, 2025 Author

Uh oh!

dolfim-ibm Jul 10, 2025 Maintainer

Uh oh!

frogFred Jul 10, 2025 Author

frogFred
Jul 10, 2025

Replies: 2 comments 2 replies

dosubot[bot]
bot Jul 10, 2025

frogFred Jul 14, 2025
Author

dolfim-ibm
Jul 10, 2025
Maintainer

frogFred Jul 10, 2025
Author