Skip to content

Markdown homogenization of training data #2510

@andreaskoepf

Description

@andreaskoepf

Some of our datasets are markdown formatted and others in plain-text. Datasets using strictly markdown (e.g. after conversion from html with a tool) escape special markdown characters like _ -> \_ and * -> \*. Currently the model has to learn that 3 * 5 = means the same as 3 \* 5 = and some of the messages refer to a variable name as a_b while others would represent the same name as a\_b. If we want our model to generate outputs always in strictly escaped markdown we need to escape all input text for training. This could for example be done by reading with a lax markdown parser and converting into strict markdown.

A different approach would be to signal to the model in some way if the input is (strict) markdown or not.

Do you think it is a problem at all, LLMs can deal with multiple languages and spelling errors quite will?
How would you handle it?
Do you know a python markdown-parser we could use for the conversion/preprocessing?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions