Suggested strategies for "skipping" incomplete/malformed markdown links? #1332

chrisrzhou · 2024-05-17T19:27:22Z

chrisrzhou
May 17, 2024

Hey remark community,

I have a question on what strategies I could consider to skip incomplete/malformed markdown links.

Given the following strings, markdown would render (using Github markdown preview to confirm):

Hello [world](https://www.world.com) -> Hello world
Hello [world -> Hello [world
Hello [world](https:// -> Hello [world](https://

I want to understand the general direction on strategies one can apply to essentially skip rendering the link if it is malformed e.g. scenarios 2 and 3 will only print Hello

I'm assuming one has to author some validator/transform e.g.

const validateMarkdown = (text: string) => string;

validateMarkdown('Hello [world](https://www.world.com)'); // ''Hello [world](https://www.world.com)"
validateMarkdown('Hello [world]'); // ''Hello "
validateMarkdown('Hello [world](https://`'); // ''Hello "

// in a similar way, `validateMarkdown` to be extended to handle HTML comments
validateMarkdown('Hello <!-- this is a comment -->'); // 'Hello <!-- this is a comment -->'
validateMarkdown('Hello <!-- this is a broken comment '); // 'Hello "

This question is mostly to understand the recommended approach from the community, and any help from relevant resources within remark/micromark ecosystem to solve this problem will be greatly appreciated.

The motivation of this question is building text-streaming applications that stream markdown back to the client. I don't believe there's native APIs in remark ecosystem that handles this use case, so I would like to know how I can solve this class of problems.

Answered by ChristianMurphy

May 20, 2024

There is no "streaming" markdown, it is always a full document.

Make the LLM produce valid markdown.
You can shape the output with libraries like: https://github.com/guidance-ai/guidance and https://github.com/outlines-dev/outlines to ensure the LLM will produce valid markdown.

View full answer

wooorm · 2024-05-20T08:26:26Z

wooorm
May 20, 2024
Maintainer

Hey!

malformed

The thing is, there is no malformed markdown.
Or HTML.
It doesn’t exist.
What you want is impossible.
Every character is always valid.

6 replies

ChristianMurphy May 20, 2024
Maintainer

Adding on, a better approach would be: rather than waiting until after a link is invalid trying to validate/fix it.
Instead make sure the link is valid in the first place.
A few ways to do that:

train authors on what valid markdown is
use a markdown-aware editor experience that automatically matches opening and closing brackets

chrisrzhou May 20, 2024
Author

Hey folks, sorry for the confusion, and to clarify:

Agree that there is no such thing as "malformed" markdown (described my problem inaccurately), it's just simple text, and every character is valid.
My problem is a domain requirement problem, and is best described by:

A (LLM service) is streaming back text response with Markdown.

(stream:progress) Hello [world]
(stream:progress) Hello [world](https://
(stream:end) Hello [world](https://www.world.com)

Without any treatment, Markdown would be rendered, as expected as below
(stream:progress) Hello [world]
(stream:progress) Hello [world](https://
(stream:end) Hello world

The domain requirement would prefer if the rendering is the following instead (i.e. do not render "broken" links)
(stream:progress) Hello
(stream:progress) Hello
(stream:end) Hello world

Specific to the "streaming" nature of this domain problem, my best attempt at a solution involves the following ideas:

The nature of streaming text means potentially misaligned expectation can only happen at the tail of the text.

Implement a replaceMarkdown = (text: string) => string that naively replaces the last invalid scenario using negative lookups regexp (since the domain problem exists only in the tail of the text). Call this utiliy only in stream:progress events

const replaceMarkdown = (text: string) => {
  const regexp = /(.*)(?<!\))\[[^\]]*\]\([^\)]*$/`; // make regexp more robust
  return replace(regex, '$1')
}

replaceMarkdown("This is [a valid link](valid link) but this is [an invalid](link "); // 'This is [a valid link](valid link) but this is '
<ReactMarkdown>{replaceMarkdown(text)}</ReactMarkdown>

When the stream:end event is reached, we can just pass text without replaceMarkdown to render the final text, giving guarantees that the eventual data is rendered.

I am aware any form of replaceMarkdown is likely going to run into a lot of edge cases as a lot has to go into handling the Commonmark spec (i.e. micromark). The above was my best attempt if I still needed to consider "a solution vs no solution". It's worth mentioning that tools like ChatGPT approach it with "no solution" (e.g. ask ChatGPT to write paragraphs of markdown with links, and we'll observe "broken links" rendering behaviors as the LLM streams the response back).

This discussion/question is purely one based on this domain problem. Just wanted advice and direction from the community, and appreciate the time and help here!

ChristianMurphy May 20, 2024
Maintainer

There is no "streaming" markdown, it is always a full document.

Make the LLM produce valid markdown.
You can shape the output with libraries like: https://github.com/guidance-ai/guidance and https://github.com/outlines-dev/outlines to ensure the LLM will produce valid markdown.

Answer selected by chrisrzhou

prichodko Feb 15, 2025

I don't think this satisfactorily answers the OP's question. When you ask LLMs to generate their response in Markdown, you may temporarily enter an "invalid" state until another chunk is streamed in.

prichodko Feb 15, 2025

This issue has nothing to do with LLMs producing invalid markdown, just imagine your response comes in two chunks.

# Lorem Ipsum

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

[Visit Example](https://www.exa

mple.com)

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

OP is asking for options to avoid displaying broken link syntax.

Sure, in theory, [Visit Example](https://www.exa is not malformed Markdown, but in practice, you can assume it's simply incomplete. Today's LLMs very rarely produce something like this.

ChristianMurphy Feb 17, 2025
Maintainer

Again, there is no "streaming markdown", there is no "invalid markdown", there is no "broken link syntax".
All markdown documents are complete and valid according to CM and GFM.

You are welcome to speculatively guess at what construct an LLM may be generating.
That is one feature that output shaping libraries like https://github.com/guidance-ai/guidance and https://github.com/outlines-dev/outlines offer, withholding tokens until something is valid.

That said, even that is making the solution more complex.
The issue isn't really with the syntax, it's that the user is looking at an incomplete document.
Consider using a faster model or hosting service if your users are stuck looking at incomplete content for long enough they would be frustrated or confused.

prichodko · 2025-02-15T20:43:38Z

prichodko
Feb 15, 2025

@chrisrzhou, great question. I have recently dealt with something similar. I think it's important to buffer the response and not expose the user to the incoming chunks directly.

There are three options:

Identify invalid syntax in the incoming chunks and flush only the valid parts.
Flush after confirming you have a complete element (for example, after new lines).
Render optimistically (see https://github.com/thetarnav/streaming-markdown). Although I'm not entirely sure how this works in practice, you can generally rely on the LLM to eventually return valid markdown syntax.

1 reply

ChristianMurphy Feb 17, 2025
Maintainer

Option 4, choose a faster model or hosting service.
Your frustration is with users seeing an incomplete document, all three options above will make a user wait even longer to see a complete document.

I'd be particularly cautious about any library that attempts to match a particular LLM's output.
Different LLMs have different markdown preferences, even within the same company (OpenAI, Anthopic, Google, Mistral) different models have different preferences, even within the same model class (like ChatGPT 4o), different checkpoints will generated different markdown.
Trying to match these different and ever changing targets is futile and backwards.

Instead use good prompting, use guide-rails, use fine tuning if needed, and if your users are sensitive to response time: choose a fast model and provider.

begilbert238 · 2025-04-23T15:14:04Z

begilbert238
Apr 23, 2025

2 replies

JounQin Apr 23, 2025
Collaborator

Please understand what markdown is and how it works first.

You comment is unhelpful.

ChristianMurphy Apr 23, 2025
Maintainer

your responses are unhelpful and do not address the original question

The original question has the problem backwards, and is a non goal for the project.
We parse standard markdown https://commonmark.org, we're happy to offer advise on how to generate valid markdown, but have no intent to try to handle invalid content in a way that goes against the specifications.

unhelpful, you're just complaining about how LLMs generate markdown

It isn't a complaint so much as an observation.
At the end of the day, you own your content creation pipeline, if your human/ai/whatever authors generate invalid content. That is a problem of the pipeline, not of markdown.

My suggestions are towards ways that the problem can be solved, which is addressing the issue where it originates, in the pipeline.

the pragmatic solution here is to use regex

Okay, you do you, not solving the problem is certainly a choice, it is not a particularly good one.

prantlf · 2025-12-19T14:03:19Z

prantlf
Dec 19, 2025

@ChristianMurphy:

There is no "streaming" markdown, it is always a full document.
Make the LLM produce valid markdown.

This is a misunderstanding. The LLMs do produce valid markdown. The problem is how to render/pre-render markdown content coming chunk by chunk (streamed), which is naturally becoming valid and invalid, before it becomes complete.

Consider using a faster model or hosting service if your users are stuck looking at incomplete content for long enough they would be frustrated or confused.

This is a dream of every front-end developer. A service, which returns the complete data instantaneously :-) But while we're awake, we build workarounds for the real-world services :-)

LLM vendors encourage the developers to use the streaming mode. Being able to stream the answer lets them optimise memory consumption during the output generation. And of course, letting the user start reading or listening to the answer earlier is important too. Producing longer and more complicated answers takes tens of seconds. Not letting the user start earlier would be wasting their time. I believe you yourself enjoy streaming answers too, when you ask AI for help :-)

So, the real-world task is to continuously render markdown chunks as they come.

Thanks for pointing at guidance and outlines. They aren't specifically for Markdown, I'll need to look at them more closely.

So far, I found the following libraries and approaches:

streaming-markdown, see @prichodko above. Complete parser + renderer, always appending, avoids re-rendering.
semidown. Wraps marked, renders block by block, doesn't prevent re-rendering on the inline level.
remend. Completes incomplete markdown constructs, used together with remark in streamdown.

1 reply

ChristianMurphy Dec 19, 2025
Maintainer

This is a misunderstanding. The LLMs do produce valid markdown. The problem is how to render/pre-render markdown content coming chunk by chunk (streamed), which is naturally becoming valid and invalid, before it becomes complete.

No misunderstanding. I understand where you are coming from a UX perspective. I am communicating to you and others that distinction does not exist at a CommonMark language level.

This is a dream of every front-end developer. A service, which returns the complete data instantaneously :-) But while we're awake, we build workarounds for the real-world services :-)

There are already faster options than next token language models.
Diffusion LLMs being the thing to keep an eye on.

https://deepmind.google/models/gemini-diffusion/
https://www.inceptionlabs.ai/
These also break the left to right next token assuption.

Or you can stick with left to right and pick a faster model class.
For GPT -> Mini or Nano
For Claude -> Haiku
For Gemini -> Flash
and every other provider has a fast flavor as well.

Most will produce tokens fast enough users will not notice or care about constructs where the LLM hasn't sent all the tokens yet, and feels incomplete to a user.

So far, I found the following libraries and approaches:

You are welcome to give them a try.
I'd recommend starting with a core of CommonMark, like micrmark/remark as streamdown does.

I'd also repeat the concerns of https://github.com/micromark/micromark?tab=readme-ov-file#extending-markdown and all my previous posts here.
Wrapping any of the libraries to make a new markdown flavor, fixes a short term problem, making a slow LLM feel faster.
But leaves a long term problem that markdown may not be valid, and may have rendering artifacts. And long term cleaning broken/non-standard content is a much harder problem to solve.

remark

Suggested strategies for "skipping" incomplete/malformed markdown links? #1332

Uh oh!

chrisrzhou May 17, 2024

Replies: 4 comments · 10 replies

Uh oh!

wooorm May 20, 2024 Maintainer

Uh oh!

Uh oh!

ChristianMurphy May 20, 2024 Maintainer

Uh oh!

Uh oh!

chrisrzhou May 20, 2024 Author

Uh oh!

ChristianMurphy May 20, 2024 Maintainer

Uh oh!

Uh oh!

prichodko Feb 15, 2025

Uh oh!

Uh oh!

prichodko Feb 15, 2025

Uh oh!

ChristianMurphy Feb 17, 2025 Maintainer

Uh oh!

Uh oh!

prichodko Feb 15, 2025

Uh oh!

Uh oh!

ChristianMurphy Feb 17, 2025 Maintainer

Uh oh!

begilbert238 Apr 23, 2025

Uh oh!

JounQin Apr 23, 2025 Collaborator

Uh oh!

Uh oh!

ChristianMurphy Apr 23, 2025 Maintainer

Uh oh!

prantlf Dec 19, 2025

Uh oh!

ChristianMurphy Dec 19, 2025 Maintainer

chrisrzhou
May 17, 2024

Replies: 4 comments 10 replies

wooorm
May 20, 2024
Maintainer

ChristianMurphy May 20, 2024
Maintainer

chrisrzhou May 20, 2024
Author

ChristianMurphy May 20, 2024
Maintainer

ChristianMurphy Feb 17, 2025
Maintainer

prichodko
Feb 15, 2025

ChristianMurphy Feb 17, 2025
Maintainer

begilbert238
Apr 23, 2025

JounQin Apr 23, 2025
Collaborator

ChristianMurphy Apr 23, 2025
Maintainer

prantlf
Dec 19, 2025

ChristianMurphy Dec 19, 2025
Maintainer