Fix MarkdownElementNodeParser to extract code blocks#20840
Fix MarkdownElementNodeParser to extract code blocks#20840Br1an67 wants to merge 1 commit intorun-llama:mainfrom
Conversation
Two issues prevented code blocks from being parsed: 1. Opening backtick fences (```) that weren't ending an existing code block were treated as text instead of starting a new code block. 2. The post-processing loop converted all non-table elements to 'text' type, erasing the 'code' type from correctly parsed code blocks. Fix both by starting a code element on unmatched backtick fences and preserving the original element type in post-processing.
|
Hey @Br1an67 just as a heads up: you opened 5 PRs in (presumably) less than 1 hour. This behavior is borderline spamming and makes me think that there might be some crawling and AI automation behind all of these PRs (also judging from the commit messages). |
| elif currentElement is not None and currentElement.type == "text": | ||
| currentElement.element += "\n" + line |
There was a problem hiding this comment.
This elif branch was the root cause of the bug. When encountering an opening backtick fence (```), if there was already a text element being accumulated, this branch would append the fence line to the text element instead of starting a new code block. By removing it, opening fences now correctly fall through to the else branch, which saves the current element and starts a new code element. This is what allows code blocks to be properly extracted rather than being swallowed into surrounding text.
Description
Fix
MarkdownElementNodeParser.extract_elements()to properly extract code blocks (fenced with ```````) ascodetype elements instead of merging them into surrounding text.Two issues prevented code blocks from being extracted:
Parsing: Opening backtick fences that weren't ending an existing code block fell through to a branch that either appended the line to existing text or created a new text element, instead of starting a new code block.
Post-processing: After parsing, the post-processing loop (line 269-275) converted all non-table elements to
type="text", erasing thecodetype from correctly parsed code elements. These were then merged with adjacent text in the consecutive-text merge step.Fixes #19085
New Package?
N/A
Version Bump?
N/A — bug fix only.
Type of Change
How Has This Been Tested?
Added two tests:
test_code_block_extraction: Verifies a simple fenced code block is extracted as acodeelementtest_code_block_with_language: Verifies code blocks with language identifiers (``````python`) are handledAll 9 tests (7 existing + 2 new) pass.
Suggested Checklist: