Preserving the integrity of code examples in PDF > MD workflows #2247
Replies: 2 comments
-
This is a known limitation in Docling's PDF-to-Markdown workflow: code blocks inside numbered lists often lose their structure, and list markers or hierarchy aren't preserved correctly. The Markdown serializer currently prepends a dash to all list items from PDFs, regardless of the original marker, which leads to munged or flattened lists and separated code blocks in the output. Maintainers have acknowledged this as a bug, but it's not considered critical for downstream applications, so no official workaround exists yet. There are no configuration options to fix this at export time—addressing it would require changes to the chunk labeling or Markdown serialization logic. Some users have tried customizing the Markdown serializer or applying pre/post-processing scripts, but these are not documented as official solutions. For now, monitoring release notes for future improvements is recommended. More details and discussion are available here and here. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
-
I suggest running the following docling --to html_split_page --show-layout --enrich-code FILE it should:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm looking for a way to bypass the issue (bug?) there the doc structure is munged when encountering separate lines of code in numbered lists. For example, I have this in PDF:
However, the md output is this:
Looks like a bug. If someone knows a workround, LMK.
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions