Skip to content

Fix MarkdownElementNodeParser to extract code blocks#20840

Open
Br1an67 wants to merge 1 commit intorun-llama:mainfrom
Br1an67:fix/issue-19085-code-block-extraction
Open

Fix MarkdownElementNodeParser to extract code blocks#20840
Br1an67 wants to merge 1 commit intorun-llama:mainfrom
Br1an67:fix/issue-19085-code-block-extraction

Conversation

@Br1an67
Copy link

@Br1an67 Br1an67 commented Mar 1, 2026

Description

Fix MarkdownElementNodeParser.extract_elements() to properly extract code blocks (fenced with ```````) as code type elements instead of merging them into surrounding text.

Two issues prevented code blocks from being extracted:

  1. Parsing: Opening backtick fences that weren't ending an existing code block fell through to a branch that either appended the line to existing text or created a new text element, instead of starting a new code block.

  2. Post-processing: After parsing, the post-processing loop (line 269-275) converted all non-table elements to type="text", erasing the code type from correctly parsed code elements. These were then merged with adjacent text in the consecutive-text merge step.

Fixes #19085

New Package?

N/A

Version Bump?

N/A — bug fix only.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

Added two tests:

  • test_code_block_extraction: Verifies a simple fenced code block is extracted as a code element
  • test_code_block_with_language: Verifies code blocks with language identifiers (``````python`) are handled

All 9 tests (7 existing + 2 new) pass.

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Two issues prevented code blocks from being parsed:

1. Opening backtick fences (```) that weren't ending an existing code
   block were treated as text instead of starting a new code block.

2. The post-processing loop converted all non-table elements to 'text'
   type, erasing the 'code' type from correctly parsed code blocks.

Fix both by starting a code element on unmatched backtick fences and
preserving the original element type in post-processing.
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Mar 1, 2026
@AstraBert
Copy link
Member

AstraBert commented Mar 2, 2026

Hey @Br1an67 just as a heads up: you opened 5 PRs in (presumably) less than 1 hour. This behavior is borderline spamming and makes me think that there might be some crawling and AI automation behind all of these PRs (also judging from the commit messages).
We are a community of human developers and maintainers and, while AI assisted code is always welcome, human oversight is fundamental: crawling + spamming PRs is not an acceptable behavior and consequences might follow if this behavior continues on your side

Comment on lines -173 to -174
elif currentElement is not None and currentElement.type == "text":
currentElement.element += "\n" + line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this eliminated?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This elif branch was the root cause of the bug. When encountering an opening backtick fence (```), if there was already a text element being accumulated, this branch would append the fence line to the text element instead of starting a new code block. By removing it, opening fences now correctly fall through to the else branch, which saves the current element and starts a new code element. This is what allows code blocks to be properly extracted rather than being swallowed into surrounding text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MarkdownElementParser does not extract code blocks

2 participants