Skip to content

Greedy Regex in strip_markdown_fences() #4397

@sksvineeth

Description

@sksvineeth

Initial Checks

Description

The .* quantifier is greedy — with re.DOTALL, it matches from the first { all the way to the last } in the entire text, crossing past the closing ``` fence.

Example

An LLM returns: ```json
{"result": "pass"}
This conforms to the schema {"type": "object"}

Expected extraction: {"result": "pass"}

Actual extraction: {"result": "pass"}\n ``` \nThis conforms to the schema {"type": "object"} the greedy `.*` matched from the first `{` to the last `}` in the entire string, gobbling up the closing fence and
everything after it.

This produces malformed JSON that fails downstream validation, potentially triggering unnecessary retries or errors.

The bug only manifests when there's a } character after the closing ``` fence. That's row 1 — the only case where broken and expected differ:

Input: json\n{"a": 1}\n + Context: {"b": 2}
Current (broken): {"a": 1}\n``` \nContext: {"b": 2}
Expected: {"a": 1}

Input: json\n{"nested": {"k": "v"}}\n
Current (broken): {"nested": {"k": "v"}}
Expected:: {"nested": {"k": "v"}}

Input: ```json\n{\n "foo": "bar"\n} (no closing fence)
Current (broken): {\n "foo": "bar"\n}
Expected:: {\n "foo": "bar"\n}

Minimal, Reproducible Example

from pydantic_ai._utils import strip_markdown_fences                        
                                           
  # LLM returns JSON in a markdown fence, followed by commentary containing   
  braces                                                                      
  text = '\n{"result": "pass"}\n\nThis matches schema {"type":
  "object"}'

  extracted = strip_markdown_fences(text)
  print(repr(extracted))
  # Actual:   '{"result": "pass"}\n\nThis matches schema {"type":
  "object"}'
  # Expected: '{"result": "pass"}'

  The greedy .* in the regex r'(?:\w+)?\n(\{.*\})' matches from the first {
   to the **last** } in the entire string, crossing past the closing  fence
   and capturing the commentary text as part of the "JSON".

Logfire Trace

N/A — this bug is in the strip_markdown_fences() regex utility function (_utils.py:517), not in a model request or agent run. It can be reproduced with a direct function call without any LLM interaction.

Python, Pydantic AI & LLM client version

  • Python: 3.12+
  • Pydantic AI: main branch (latest)
  • LLM provider SDK: Any (bug is in utility function, not provider-specific)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugReport that something isn't working, or PR implementing a fix

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions