Markdown homogenization of training data

Some of our datasets are markdown formatted and others in plain-text. Datasets using strictly markdown (e.g. after conversion from html with a tool) escape special markdown characters like `_` -> `\_` and `*` -> `\*`.  Currently the model has to learn that `3 * 5 =` means the same as `3 \* 5 =` and some of the messages refer to a variable name as `a_b` while others would represent the same name as `a\_b`. If we want our model to generate outputs always in strictly escaped markdown we need to escape all input text for training. This could for example be done by reading with a lax markdown parser and converting into strict markdown.

A different approach would be to signal to the model in some way if the input is (strict) markdown or not.

Do you think it is a problem at all, LLMs can deal with multiple languages and spelling errors quite will?
How would you handle it?
Do you know a python markdown-parser we could use for the conversion/preprocessing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Markdown homogenization of training data #2510

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Markdown homogenization of training data #2510

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions